How to split parquet files into many partitions in Spark?

You should write your parquet files with a smaller block size. Default is 128Mb per block, but it's configurable by setting parquet.block.size configuration in the writer.

The source of ParquetOuputFormat is here, if you want to dig into details.

The block size is minimum amount of data you can read out of a parquet file which is logically readable (since parquet is columnar, you can't just split by line or something trivial like this), so you can't have more reading threads than input blocks.

The new way of doing it (Spark 2.x) is setting

spark.sql.files.maxPartitionBytes

Source: https://issues.apache.org/jira/browse/SPARK-17998 (the official documentation is not correct yet, misses the .sql)

From my experience, Hadoop settings no longer have effect.

How to split parquet files into many partitions in Spark?

Tags:

Scala

Apache Spark

Parquet

Related

Recent Posts