SQL query in Spark/scala Size exceeds Integer.MAX_VALUE

No Spark shuffle block can be larger than 2GB (Integer.MAX_VALUE bytes) so you need more / smaller partitions.

You should adjust spark.default.parallelism and spark.sql.shuffle.partitions (default 200) such that the number of partitions can accommodate your data without reaching the 2GB limit (you could try aiming for 256MB / partition so for 200GB you get 800 partitions). Thousands of partitions is very common so don't be afraid to repartition to 1000 as suggested.

FYI, you may check the number of partitions for an RDD with something like rdd.getNumPartitions (i.e. d2.rdd.getNumPartitions)

There's a story to track the effort of addressing the various 2GB limits (been open for a while now): https://issues.apache.org/jira/browse/SPARK-6235

See http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications/25 for more info on this error.

SQL query in Spark/scala Size exceeds Integer.MAX_VALUE

Tags:

Sql

Amazon Ec2

Emr

Apache Spark

Related

Recent Posts