How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?

Yes, this is the right way to do this but there is a small mistake in your code. Let me break this down for you.

When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.


What happens can be described as follows:

  • Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
  • Step 1: the scaler is fitted on the TRAINING data
  • Step 2: the scaler transforms TRAINING data
  • Step 3: the models are fitted/trained using the transformed TRAINING data
  • Step 4: the scaler is used to transform the TEST data
  • Step 5: the trained models predict using the transformed TEST data

Note: You should be using grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically split the data into training and testing data (this happen internally).


Use something like this:

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA

pipe = Pipeline([
        ('scale', StandardScaler()),
        ('reduce_dims', PCA(n_components=4)),
        ('clf', SVC(kernel = 'linear', C = 1))])

param_grid = dict(reduce_dims__n_components=[4,6,8],
                  clf__C=np.logspace(-4, 1, 6),
                  clf__kernel=['rbf','linear'])

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)

Once you run this code (when you call grid.fit(X, y)), you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_ member provides access to the best score observed during the optimization procedure and the best_params_ describes the combination of parameters that achieved the best results.


IMPORTANT EDIT 1: if you want to keep a validation dataset of the original dataset use this:

X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation 
    = train_test_split(X, y, test_size=0.15, random_state=1)

Then use:

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)