What is a glom?. How it is different from mapPartitions?

Does glom shuffle the data across partitions

No, it doesn't

If this is the second case I believe that the same can be achieved using mapPartitions

It can:

rdd.mapPartitions(iter => Iterator(_.toArray))

but the same thing applies to any non shuffling transformation like map, flatMap or filter.

if there are any use cases which benefit from glob.

Any situation where you need to access partition data in a form that is traversable more than once.


glom() transforms each partition into a tuple (immutabe list) of elements. It creates an RDD of tuples. One tuple per partition.