How can I merge spark results files without repartition and copyMerge?

Unfortunately, there is not other option to get a single output file in Spark. Instead of repartition(1) you can use coalesce(1), but with parameter 1 their behavior would be the same. Spark would collect your data in a single partition in memory which might cause OOM error if your data is too big.

Another option for merging files on HDFS might be to write a simple MapReduce job (or Pig job, or Hadoop Streaming job) that would get the whole directory as an input and using a single reducer generate you a single output file. But be aware that with the MapReduce approach all the data would be first copied to the reducer local filesystem which might cause "out of space" error.

Here are some useful links on the same topic:

merge output files after reduce phase
Merging hdfs files
Merging multiple files into one within Hadoop

How can I merge spark results files without repartition and copyMerge?

Tags:

Hadoop

Scala

Apache Spark

Related

Recent Posts