How to partition RDD by key in Spark?

How about just doing a groupByKey using kind. Or another PairRDDFunctions method.

You make it seem to me that you don't really care about the partitioning, just that you get all of a specific kind in one processing flow?

The pair functions allow this:

rdd.keyBy(_.kind).partitionBy(new HashPartitioner(PARTITIONS))
   .foreachPartition(...)

However, you can probably be a little safer with something more like:

rdd.keyBy(_.kind).reduceByKey(....)

or mapValues or a number of the other pair functions that guarantee you get the pieces as a whole

Would it be correct to partition an RDD[DeviceData] by overwriting the deviceData.hashCode() method and use only the hashcode of kind?

It wouldn't be. If you take at the Java Object.hashCode documentation you'll find following information about the general contract of hashCode:

If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

So unless notion of equality based purely on a kind of device fits your use case, and I seriously doubt it does, tinkering with HashCode to get desired partitioning is a bad idea. In general case you should implement your own partitioner but here it is not required.

Since, excluding specialized scenarios in SQL and GraphX, partitionBy is valid only on PairRDD it makes sense to create RDD[(String, DeviceData)] and use plain HashPartitioner

deviceDataRdd.map(dev => (dev.kind, dev)).partitionBy(new HashPartitioner(n))

Just keep in mind that in a situation where kind has low cardinality or highly skewed distribution using it for partitioning may be not an optimal solution.

How to partition RDD by key in Spark?

Tags:

Scala

Apache Spark

Rdd

Related

Recent Posts