Apply sklearn trained model on a dataframe with PySpark

I had to do same thing in recent project. The bad thing about applying udf for each row that pyspark has to read sklearn model each time so that's why it takes ages to finish. The best solution I have found was to use .mapPartitions or foreachPartition method on rdd, really good explanation is here

https://github.com/mahmoudparsian/pyspark-tutorial/blob/master/tutorial/map-partitions/README.md

It works fast because it ensures you that there is no shuffling and for each partition pyspark has to read the model and predict only once. So, the flow would be:

convert DF to RDD
broadcast model to nodes so it will be accessible for workers
write an udf function which takes interator (which contains all rows within a partition) as an argument
iterate through rows and create a proper matrix with your features (order matters)
call .predict only once
return predictions
transform rdd to df if needed

Apply sklearn trained model on a dataframe with PySpark

Tags:

Python

Scikit Learn

Apache Spark

Pyspark

Related

Recent Posts