Differences between null and NaN in spark? How to deal with it?

null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.

NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0.

One possible way to handle null values is to remove them with:

df.na.drop()

Or you can change them to an actual value (here I used 0) with:

df.na.fill(0)

Another way would be to select the rows where a specific column is null for further processing:

df.where(col("a").isNull())
df.where(col("a").isNotNull())

Rows with NaN can also be selected using the equivalent method:

from pyspark.sql.functions import isnan
df.where(isnan(col("a")))

You can diference your NaN values using the function isnan, like this example

>>> df = spark.createDataFrame([(1.0, float('nan')), (float('nan'), 2.0)], ("a", "b"))
>>> df.select(isnan("a").alias("r1"), isnan(df.a).alias("r2")).collect()
[Row(r1=False, r2=False), Row(r1=True, r2=True)]

The difference is in the type of the object that generates the value. NaN (not a number) is an old fashion way to deal with the "None value for a number", you can think that you have all the numbers (-1-2...0,1,2...) and there is the need to have and extra value, for cases of errors (example, 1/0), I want that 1/0 gives me a number, but which number? well, like there is number for 1/0, they create a new value called NaN, that is also of type Number.

None is used for the void, absence of an element, is even more abstract, because inside the number type, you have, besides de NaN value, the None value. The None value is present in all the sets of values of all the types

Differences between null and NaN in spark? How to deal with it?

Tags:

Python

Null

Nan

Apache Spark

Pyspark

Related

Recent Posts