Difference between === null and isNull in Spark DataDrame

First and foremost don't use null in your Scala code unless you really have to for compatibility reasons.

Regarding your question it is plain SQL. col("c1") === null is interpreted as c1 = NULL and, because NULL marks undefined values, result is undefined for any value including NULL itself.

spark.sql("SELECT NULL = NULL").show
+-------------+
|(NULL = NULL)|
+-------------+
|         null|
+-------------+
spark.sql("SELECT NULL != NULL").show
+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
|               null|
+-------------------+
spark.sql("SELECT TRUE != NULL").show
+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
|                                null|
+------------------------------------+
spark.sql("SELECT TRUE = NULL").show
+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
|                          null|
+------------------------------+

The only valid methods to check for NULL are:

  • IS NULL:

    spark.sql("SELECT NULL IS NULL").show
    
    +--------------+
    |(NULL IS NULL)|
    +--------------+
    |          true|
    +--------------+
    
    spark.sql("SELECT TRUE IS NULL").show
    
    +--------------+
    |(true IS NULL)|
    +--------------+
    |         false|
    +--------------+
    
  • IS NOT NULL:

    spark.sql("SELECT NULL IS NOT NULL").show
    
    +------------------+
    |(NULL IS NOT NULL)|
    +------------------+
    |             false|
    +------------------+
    
    spark.sql("SELECT TRUE IS NOT NULL").show
    
    +------------------+
    |(true IS NOT NULL)|
    +------------------+
    |              true|
    +------------------+
    

implemented in DataFrame DSL as Column.isNull and Column.isNotNull respectively.

Note:

For NULL-safe comparisons use IS DISTINCT / IS NOT DISTINCT:

spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show
+---------------+
|(NULL <=> NULL)|
+---------------+
|           true|
+---------------+
spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show
+--------------------------------+
|(CAST(NULL AS BOOLEAN) <=> true)|
+--------------------------------+
|                           false|
+--------------------------------+

or not(_ <=> _) / <=>

spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
|           true|
+---------------+
spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
|          false|
+---------------+

in SQL and DataFrame DSL respectively.

Related:

Including null values in an Apache Spark Join


Usually the best way to shed light onto unexpected results in Spark Dataframes is to look at the explain plan. Consider the following example:

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._

object Example extends App {

  val session = SparkSession.builder().master("local[*]").getOrCreate()
  case class Record(c1: String, c2: String)
  val data = List(Record("a", "b"), Record(null, "c"))
  val rdd = session.sparkContext.parallelize(data)
  import session.implicits._

  val df: DataFrame = rdd.toDF
  val filtered = df.filter(col("c1") === null)
  println(filtered.count()) // <-- outputs 0, not expected

  val filtered2 = df.filter(col("c1").isNull)
  println(filtered2.count())
  println(filtered2) // <- outputs 1, as expected

  filtered.explain(true)
  filtered2.explain(true)
}

The first explain plan shows:

== Physical Plan ==
*Filter (isnotnull(c1#2) && null)
+- Scan ExistingRDD[c1#2,c2#3]
== Parsed Logical Plan ==
'Filter isnull('c1)
+- LogicalRDD [c1#2, c2#3]

This filter clause looks nonsensical. The && to null ensures this can never resolve to true.

The second explain plan looks like:

== Physical Plan ==
*Filter isnull(c1#2)
+- Scan ExistingRDD[c1#2,c2#3]

Here the filter is what expect and want.