How to filter based on array value in PySpark?

In spark 2.4 you can filter array values using filter function in sql API.

Here's example in pyspark. In the example we filter out all array values which are empty strings:

df = df.withColumn("ArrayColumn", expr("filter(ArrayColumn, x -> x != '')"))

For equality based queries you can use array_contains:

df = sc.parallelize([(1, [1, 2, 3]), (2, [4, 5, 6])]).toDF(["k", "v"])

# With SQL
sqlContext.sql("SELECT * FROM df WHERE array_contains(v, 1)")

# With DSL
from pyspark.sql.functions import array_contains
df.where(array_contains("v", 1))

If you want to use more complex predicates you'll have to either explode or use an UDF, for example something like this:

from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf 

def exists(f):
    return udf(lambda xs: any(f(x) for x in xs), BooleanType())

df.where(exists(lambda x: x > 3)("v"))

In Spark 2.4. or later it is also possible to use higher order functions

from pyspark.sql.functions import expr

    transform(v, x -> x > 3),
    (x, y) -> x or y


    exists(v, x -> x > 3)

Python wrappers should be available in 3.1 (SPARK-30681).