Apply sklearn trained model on a dataframe with PySpark

I had to do same thing in recent project. The bad thing about applying udf for each row that pyspark has to read sklearn model each time so that's why it takes ages to finish. The best solution I have found was to use .mapPartitions or foreachPartition method on rdd, really good explanation is here

https://github.com/mahmoudparsian/pyspark-tutorial/blob/master/tutorial/map-partitions/README.md

It works fast because it ensures you that there is no shuffling and for each partition pyspark has to read the model and predict only once. So, the flow would be:

  • convert DF to RDD
  • broadcast model to nodes so it will be accessible for workers
  • write an udf function which takes interator (which contains all rows within a partition) as an argument
  • iterate through rows and create a proper matrix with your features (order matters)
  • call .predict only once
  • return predictions
  • transform rdd to df if needed