Does Spark preserve record order when reading in ordered files?

Order is not preserved when the data is shuffled. You can, however, enumerate the rows before doing your computations. If you are using an RDD, there's a function called zipWithIndex (RDD[T] => RDD[(T, Long)]) that does exactly what you are searching.


Yes, when reading from file, Spark maintains the order of records. But when shuffling occurs, the order is not preserved. So in order to preserve the order, either you need to program so that no shuffling occurs in data or you create a seq. numbers to the records and use those seq. numbers while processing.

In a distribute framework like Spark where data is divided in cluster for fast processing, shuffling of data is sure to occur. So the best solution is create a sequential numbers to each rows and use that sequential number for ordering.

Tags:

Apache Spark