Spark' Dataset unpersist behaviour

Answer for Spark 2.4:

There was a ticket about correctness in Datasets and caching behaviour, see https://issues.apache.org/jira/browse/SPARK-24596

From Maryann Xue description, now caching will work in following manner:

Drop tables and regular (persistent) views: regular mode
Drop temporary views: non-cascading mode
Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
Call DataSet.unpersist(): non-cascading mode
Call Catalog.uncacheTable(): follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest

Where "regular mode" means mdoe from the questions and @Avishek's answer and non-cascading mode means, that extension won't be unpersisted

This is an expected behavior from spark caching. Spark doesn't want to keep invalid cache data. It completely removes all the cached plans refer to the dataset.

This is to make sure the query is correct. In the example you are creating extension dataset from cached dataset data. Now if the dataset data is unpersisted essentially extension dataset can no longer rely on the cached dataset data.

Here is the Pull request for the fix they made. You can see similar JIRA ticket

Spark' Dataset unpersist behaviour

Tags:

Apache Spark

Apache Spark Sql

Related

Recent Posts