How to re-partition pyspark dataframe?

print df.rdd.getNumPartitions()
# 1


df.repartition(5)
print df.rdd.getNumPartitions()
# 1


df = df.repartition(5)
print df.rdd.getNumPartitions()
# 5

see Spark: The definitive Guide chapter 5- Basic Structure Operations
ISBN-13: 978-1491912218
ISBN-10: 1491912219


You can check the number of partitions:

data.rdd.partitions.size

To change the number of partitions:

newDF = data.repartition(3000)

You can check the number of partitions:

newDF.rdd.partitions.size

Beware of data shuffle when repartitionning and this is expensive. Take a look at coalesce if needed.