GridSearchCV extremely slow on small dataset in scikit-learn

As noted already, for SVM-based Classifiers ( as y == np.int* ) preprocessing is a must, otherwise the ML-Estimator's prediction capability is lost right by skewed features' influence onto a decission function.

As objected the processing times:

  • try to get better view what is your AI/ML-Model Overfit/Generalisation [C,gamma] landscape
  • try to add verbosity into the initial AI/ML-process tuning
  • try to add n_jobs into the number crunching
  • try to add Grid Computing move into your computation approach if scale requires

.

aGrid = aML_GS.GridSearchCV( aClassifierOBJECT, param_grid = aGrid_of_parameters, cv = cv, n_jobs = n_JobsOnMultiCpuCores, verbose = 5 )

Sometimes, the GridSearchCV() can indeed take a huge amount of CPU-time / CPU-poolOfRESOURCEs, even after all the above mentioned tips are used.

So, keep calm and do not panic, if you are sure the Feature-Engineering, data-sanity & FeatureDOMAIN preprocessing was done correctly.

[GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.761619 -62.7min
[GridSearchCV] C=16777216.0, gamma=0.5 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.792793 -64.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.793103 -116.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.794603 -205.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.771772 -200.9min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.713643 -446.0min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.743628 -184.6min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.761261 -281.2min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=4.0, score=0.670165 -138.7min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.760120 -97.3min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.732733 -66.3min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.755622 -13.6min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.772114 - 4.6min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.717718 -14.7min
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.763118 - 1.3min
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.746627 -  25.4s
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.738739 -  44.9s
[Parallel(n_jobs=1)]: Done 2700 out of 2700 | elapsed: 5670.8min finished

As have asked above about "... a regular svm.SVC().fit " kindly notice, it uses default [C,gamma] values and thus have no relevance to behaviour of your Model / ProblemDOMAIN.

Re: Update

oh yes indeed, regularisation/scaling of SVM-inputs is a mandatory task for this AI/ML tool. scikit-learn has a good instrumentation to produce and re-use aScalerOBJECT for both a-priori scaling ( before aDataSET goes into .fit() ) & ex-post ad-hoc scaling, once you need to re-scale a new example and send it to the predictor to answer it's magic via a request to anSvmCLASSIFIER.predict( aScalerOBJECT.transform( aNewExampleX ) )

( Yes, aNewExampleX may be a matrix, so asking for a "vectorised" processing of several answers )

Performance relief of O( M 2 . N 1 ) computational complexity

In contrast to the below posted guess, that the Problem-"width", measured as N == a number of SVM-Features in matrix X is to be blamed for an overall computing time, the SVM classifier with rbf-kernel is by-design an O( M 2 . N 1 ) problem.

So, there is quadratic dependence on the overall number of observations ( examples ), moved into a Training ( .fit() ) or CrossValidation phase and one can hardly state, that the supervised learning classifier will get any better predictive power if one "reduces" the ( linear only ) "width" of features, that per se bear the inputs into the constructed predictive power of the SVM-classifier, don't they?


Support Vector Machines are sensitive to scaling. It is most likely that your SVC is taking a longer time to build an individual model. GridSearch is basically a brute force method which runs the base models with different parameters. So, if your GridSearchCV is taking time to build, it is more likely due to

  1. Large number of parameter combinations (Which is not the case here)
  2. Your individual model takes a lot of time.