How can I check whether my RDD or dataframe is cached or not?

Starting since Spark (Scala) 2.1.0, this can be checked for a dataframe as follows:

dataframe.storageLevel.useMemory

@Arnab,

Did you find the function in Python?
Here is an example for DataFrame DF:

DF.cache()
print DF.is_cached

Hope this helps.
Ram


You can call getStorageLevel.useMemory on the Dataframe and the RDD to find out if the dataset is in memory.

For the Dataframe do this:

scala> val df = Seq(1, 2).toDF()
df: org.apache.spark.sql.DataFrame = [value: int]

scala> df.storageLevel.useMemory
res1: Boolean = false

scala> df.cache()
res0: df.type = [value: int]

scala> df.storageLevel.useMemory
res1: Boolean = true

For the RDD do this:

scala> val rdd = sc.parallelize(Seq(1,2))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:21

scala> rdd.getStorageLevel.useMemory
res9: Boolean = false

scala> rdd.cache()
res10: rdd.type = ParallelCollectionRDD[1] at parallelize at <console>:21

scala> rdd.getStorageLevel.useMemory
res11: Boolean = true

Tags:

Apache Spark