Grid Search parameter and cross-validated data set in KNN classifier in Scikit-learn

  1. Yes you can CV on your entire dataset it is viable, but I still suggest you to at least split your data into 2 sets one for CV and one for testing.

  2. The .score function is supposed to return a single float value according to the documentation which is the score of the best estimator(which is the best scored estimator you get from fitting your GridSearchCV) on the given X,Y

  3. If you saw that the best parameter is 14 than yes you can go on whith using it in your model, but if you gave it more parameters you should set all of them. (- I say that because you haven't given your parameters list) And yes it is legitimate to check your CV once again just in case if this model is as good as it should.

Hope that makes the things clearer :)


If the dataset is small, you may not have the luxury for a train/test split. People often estimate the predictive power of the model solely based on cross-validation. In your code above, the GridSearchCV performs 5-fold cross-validation when you fit the model (clf.fit(X, y)) by splitting your train set into an inner train set (80%) and a validation set (20%).

You can access the model performance metrics including validation scores through clf.cv_results_. The metrics you want to look at including mean_test_score (In your case, you should have 1 score for each n_neighbor). You may also want to turn on 'mean_train_score' to have a sense of whether the model is overfitting. See sample code below for model setup (Note knn is a non-parametric ML model that is sensitive to the scale of the features so people often normalize features using StandardScaler):

    pipe = Pipeline([
        ('sc', StandardScaler()),     
        ('knn', KNeighborsClassifier(algorithm='brute')) 
    ])
    params = {
        'knn__n_neighbors': [3, 5, 7, 9, 11] # usually odd numbers
    }
    clf = GridSearchCV(estimator=pipe,           
                      param_grid=params, 
                      cv=5,
                      return_train_score=True) # Turn on cv train scores
    clf.fit(X, y)

A quick tip: the square-root of the number of samples is usually a good choice of n_neighbor so make sure you include that in your GridSearchCV. Hope this is helpful.