How to speed up nested cross validation in python?

The Dask-ML has scalable implementations GridSearchCV and RandomSearchCV that are, I believe, drop in replacements for Scikit-Learn. They were developed alongside Scikit-Learn developers.

  • https://ml.dask.org/hyper-parameter-search.html

They can be faster for two reasons:

  • They avoid repeating shared work between different stages of a Pipeline
  • They can scale out to a cluster anywhere you can deploy Dask (which is easy on most cluster infrastructure)

Two things:

  1. Instead of GridSearch try using HyperOpt - it's a Python library for serial and parallel optimization.

  2. I would reduce the dimensionality by using UMAP or PCA. Probably UMAP is the better choice.

After you apply SMOTE:

import umap

dim_reduced = umap.UMAP(
        min_dist=min_dist,
        n_neighbors=neighbours,
        random_state=1234,
    ).fit_transform(smote_output)

And then you can use dim_reduced for the train test split.

Reducing the dimensionality will help to remove noise from the data and instead of dealing with 25 features you'll bring them down to 2 (using UMAP) or the number of components you choose (using PCA). Which should have significant implications on the performance.


There is an easy win in your situation and that is .... start using parallel processing :). dask will help you if you have a cluster (it will work on a single machine, but the improvement compared to the default scheduling in sklearn is not significant), but if you plan to run it on a single machine (but have several cores/threads and "enough" memory) then you can run nested CV in parallel. The only trick is that sklearn will not allow you to run the outer CV loop in multiple processes. However, it will allow you to run the inner loop in multiple threads.

At the moment you have n_jobs=None in the outer CV loop (that's the default in cross_val_score), which means n_jobs=1 and that's the only option that you can use with sklearn in the nested CV.

However, you can achieve and easy gain by setting n_jobs=some_reasonable_number in all GridSearchCV that you use. some_reasonable_number does not have to be -1 (but it is a good starting point). Some algorithms either plateau on n_jobs=n_cores instead of n_threads (for example, xgboost) or already have built-in multi-processing (like RandomForestClassifier, for example) and there might be clashes if you spawn too many processes.