Why does CalibratedClassifierCV underperform a direct classifer?

There are a couple of issues with the isotonic regression method (and its implementation in sklearn) that make it a suboptimal choice for calibration.

Specifically:

1) It fits a piecewise constant function rather than a smoothly varying curve for the calibration function.

2) The Cross-Validation averages the results of the models/calibrations that it gets from each fold. However, each of those results is still fit and calibrated only on the respective folds.

Often, a better choice is the SplineCalibratedClassifierCV class in the ML-insights package (Disclaimer: I am an author of that package). The github repo for the package is here.

It has the following advantages:

1) It fits a cubic smoothing spline rather than a piecewise constant function.

2) It uses the entire (cross-validated) answer set for calibration and refits the base model on the full data set. Thus, both the calibration function and the base model are effectively trained on the full data set.

You can see examples of comparisons here and here.

From the first example, here is a graph which shows the binned probabilities of a training set (red dots), independent test set (green + signs), and the calibrations computed by the ML-insights spline method (blue line) and the isotonic-sklearn method (gray dots/line).

Spline vs Isotonic Calibration

I modified your code to compare the methods (and hiked up the number of examples). It demonstrates that the spline approach typicall performs better (as do the examples I linked to above).

Here is the code and the results:

Code (you will have to pip install ml_insights first):

import numpy as np
from sklearn.datasets import make_classification
from sklearn import ensemble
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss
from sklearn import cross_validation
import ml_insights as mli

X, y = make_classification(n_samples=10000,
                           n_features=100,
                           n_informative=30,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=9,
                           random_state=0,
                           shuffle=False)

skf = cross_validation.StratifiedShuffleSplit(y, 5)

for train, test in skf:

    X_train, X_test = X[train], X[test]
    y_train, y_test = y[train], y[test]

    clf = ensemble.GradientBoostingClassifier(n_estimators=100)    
    clf_cv_mli = mli.SplineCalibratedClassifierCV(clf, cv=3)
    clf_cv_mli.fit(X_train, y_train)
    probas_cv_mli = clf_cv_mli.predict_proba(X_test)
    cv_score_mli = log_loss(y_test, probas_cv_mli)

    clf = ensemble.GradientBoostingClassifier(n_estimators=100)    
    clf_cv = CalibratedClassifierCV(clf, cv=3, method='isotonic')
    clf_cv.fit(X_train, y_train)
    probas_cv = clf_cv.predict_proba(X_test)
    cv_score = log_loss(y_test, probas_cv)

    clf = ensemble.GradientBoostingClassifier(n_estimators=100)
    clf.fit(X_train, y_train)
    probas = clf.predict_proba(X_test)
    clf_score = log_loss(y_test, probas) 

    clf = ensemble.GradientBoostingClassifier(n_estimators=100)    
    clf_cv_mli = mli.SplineCalibratedClassifierCV(clf, cv=10)
    clf_cv_mli.fit(X_train, y_train)
    probas_cv_mli = clf_cv_mli.predict_proba(X_test)
    cv_score_mli_10 = log_loss(y_test, probas_cv_mli)

    clf = ensemble.GradientBoostingClassifier(n_estimators=100)    
    clf_cv = CalibratedClassifierCV(clf, cv=10, method='isotonic')
    clf_cv.fit(X_train, y_train)
    probas_cv = clf_cv.predict_proba(X_test)
    cv_score_10 = log_loss(y_test, probas_cv)

    print('\nuncalibrated score: {}'.format(clf_score))
    print('\ncalibrated score isotonic-sklearn (3-fold): {}'.format(cv_score))
    print('calibrated score mli (3-fold): {}'.format(cv_score_mli))
    print('\ncalibrated score isotonic-sklearn (10-fold): {}'.format(cv_score_10))
    print('calibrated score mli (10-fold): {}\n'.format(cv_score_mli_10))

Results

uncalibrated score: 1.4475396740876696

calibrated score isotonic-sklearn (3-fold): 1.465140552847886
calibrated score mli (3-fold): 1.3651638065446683

calibrated score isotonic-sklearn (10-fold): 1.4158622673607426
calibrated score mli (10-fold): 1.3620771116522705


uncalibrated score: 1.5097320476479625

calibrated score isotonic-sklearn (3-fold): 1.5189534673089442
calibrated score mli (3-fold): 1.4386253950100405

calibrated score isotonic-sklearn (10-fold): 1.4976505139437257
calibrated score mli (10-fold): 1.4408912879989917


uncalibrated score: 1.4654527691892194

calibrated score isotonic-sklearn (3-fold): 1.493355643575107
calibrated score mli (3-fold): 1.388789694535648

calibrated score isotonic-sklearn (10-fold): 1.419760490609242
calibrated score mli (10-fold): 1.3830851694161692


uncalibrated score: 1.5163851866969407

calibrated score isotonic-sklearn (3-fold): 1.5532628847926322
calibrated score mli (3-fold): 1.459797287154743

calibrated score isotonic-sklearn (10-fold): 1.4748100659449732
calibrated score mli (10-fold): 1.4620173012979816


uncalibrated score: 1.4760935523959617

calibrated score isotonic-sklearn (3-fold): 1.469434735152088
calibrated score mli (3-fold): 1.402024502986732

calibrated score isotonic-sklearn (10-fold): 1.4702032019673137
calibrated score mli (10-fold): 1.3983943648572212

The probability calibration itself requires cross-validation, therefore the CalibratedClassifierCV trains a calibrated classifier per fold (in this case using StratifiedKFold), and takes the mean of the predicted probabilities from each classifier when you call predict_proba(). This could lead to the explanation of the effect.

My hypothesis is that if the training set is small with respect to the number of features and classes, the reduced training set for each sub-classifier affects performance and the ensembling does not make up for it (or makes it worse). Also the GradientBoostingClassifier might provide already pretty good probability estimates from the start as its loss function is optimized for probability estimation.

If that's correct, ensembling classifiers the same way as the CalibratedClassifierCV but without calibration should be worse than the single classifier. Also, the effect should disappear when using a larger number of folds for calibration.

To test that, I extended your script to increase the number of folds and include the ensembled classifier without calibration, and I was able to confirm my predictions. A 10-fold calibrated classifier always performed better than the single classifier and the uncalibrated ensemble was significantly worse. In my run, the 3-fold calibrated classifier also did not really perform worse than the single classifier, so this might be also an unstable effect. These are the detailed results on the same dataset:

Log-loss results from cross-validation

This is the code from my experiment:

import numpy as np
from sklearn.datasets import make_classification
from sklearn import ensemble
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss
from sklearn import cross_validation

X, y = make_classification(n_samples=1000,
                           n_features=100,
                           n_informative=30,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=9,
                           random_state=0,
                           shuffle=False)

skf = cross_validation.StratifiedShuffleSplit(y, 5)

for train, test in skf:

    X_train, X_test = X[train], X[test]
    y_train, y_test = y[train], y[test]

    clf = ensemble.GradientBoostingClassifier(n_estimators=100)
    clf_cv = CalibratedClassifierCV(clf, cv=3, method='isotonic')
    clf_cv.fit(X_train, y_train)
    probas_cv = clf_cv.predict_proba(X_test)
    cv_score = log_loss(y_test, probas_cv)
    print 'calibrated score (3-fold):', cv_score


    clf = ensemble.GradientBoostingClassifier(n_estimators=100)
    clf_cv = CalibratedClassifierCV(clf, cv=10, method='isotonic')
    clf_cv.fit(X_train, y_train)
    probas_cv = clf_cv.predict_proba(X_test)
    cv_score = log_loss(y_test, probas_cv)
    print 'calibrated score (10-fold:)', cv_score

    #Train 3 classifiers and take average probability
    skf2 = cross_validation.StratifiedKFold(y_test, 3)
    probas_list = []
    for sub_train, sub_test in skf2:
        X_sub_train, X_sub_test = X_train[sub_train], X_train[sub_test]
        y_sub_train, y_sub_test = y_train[sub_train], y_train[sub_test]
        clf = ensemble.GradientBoostingClassifier(n_estimators=100)
        clf.fit(X_sub_train, y_sub_train)
        probas_list.append(clf.predict_proba(X_test))
    probas = np.mean(probas_list, axis=0)
    clf_ensemble_score = log_loss(y_test, probas)
    print 'uncalibrated ensemble clf (3-fold) score:', clf_ensemble_score

    clf = ensemble.GradientBoostingClassifier(n_estimators=100)
    clf.fit(X_train, y_train)
    probas = clf.predict_proba(X_test)
    score = log_loss(y_test, probas)
    print 'direct clf score:', score
    print