sklearn: Hyperparameter tuning by gradient descent?

The calculation of the gradient is the least of problems. At least in times of advanced automatic differentiation software. (Implementing this in a general way for all sklearn-classifiers of course is not easy)

And while there are works of people who used this kind of idea, they only did this for some specific and well-formulated problem (e.g. SVM-tuning). Furthermore there probably were a lot of assumptions because:

Why is this not a good idea?

  • Hyper-param optimization is in general: non-smooth
    • GD really likes smooth functions as a gradient of zero is not helpful
    • (Each hyper-parameter which is defined by some discrete-set (e.g. choice of l1 vs. l2 penalization) introduces non-smooth surfaces)
  • Hyper-param optimization is in general: non-convex
    • The whole convergence-theory of GD assumes, that the underlying problem is convex
      • Good-case: you obtain some local-minimum (can be arbitrarily bad)
      • Worst-case: GD is not even converging to some local-minimum

I might add, that your general problem is the worst kind of optimization problem one can consider because it's:

  • non-smooth, non-convex
  • and even stochastic / noisy as most underlying algorithms are heuristic approximations with some variance in regards to the final output (and often even PRNG-based random-behaviour).

The last part is the reason, why the offered methods in sklearn are that simple:

  • random-search:
    • if we can't infere something because the problem is too hard, just try many instances and pick the best
  • grid-search:
    • let's assume there is some kind of smoothness
      • instead of random-sampling, we sample in regards to our smoothness-assumption
        • (and other assumptions like: param is probably big -> np.logspace to analyze more big numbers)

While there are a lot of Bayesian-approaches including available python-software like hyperopt and spearmint, many people think, that random-search is the best method in general (which might be surprising but emphasizes the mentioned problems).


Here are some papers describing gradient-based hyperparameter optimization:

  • Gradient-based hyperparameter optimization through reversible learning (2015):

We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.

  • Forward and reverse gradient-based hyperparameter optimization (2017):

We study two procedures (reverse-mode and forward-mode) for computing the gradient of the validation error with respect to the hyperparameters of any iterative learning algorithm such as stochastic gradient descent. These procedures mirror two methods of computing gradients for recurrent neural networks and have different trade-offs in terms of running time and space requirements. Our formulation of the reverse-mode procedure is linked to previous work by Maclaurin et al. [2015] but does not require reversible dynamics. The forward-mode procedure is suitable for real-time hyperparameter updates, which may significantly speed up hyperparameter optimization on large datasets.

  • Gradient descent: the ultimate optimizer (2019):

Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as the learning rate. There exist many techniques for automated hyperparameter optimization, but they typically introduce even more hyperparameters to control the hyperparameter optimization process. We propose to instead learn the hyperparameters themselves by gradient descent, and furthermore to learn the hyper-hyperparameters by gradient descent as well, and so on ad infinitum. As these towers of gradient-based optimizers grow, they become significantly less sensitive to the choice of top-level hyperparameters, hence decreasing the burden on the user to search for optimal values.


For generalized linear models (i.e. logistic regression, ridge regression, poisson regression), you can efficiently tune many regularization hyperparameters using exact derivatives and approximate leave-one cross-validation.

But don't stop at just the gradient, compute the full hessian and use a second-order optimizer -- it's both more efficient and robust.

sklearn doesn't currently have this functionality, but there are other tools available that can do it.

For example, here's how you can use the python package bbai to fit the hyperparameter for ridge regularized logistic regression to maximize the log likelihood of the approximate leave-one-out cross-validation of the training data set for the Wisconsin Breast Cancer Data Set.

Load the data set

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']

Fit the model

import bbai.glm

model = bbai.glm.LogisticRegression()
# Note: it automatically fits the C parameter to minimize the error on
# the approximate leave-one-out cross-validation.
model.fit(X, y)

Because it uses use both the gradient and hessian with efficient exact formulas (no automatic differentiation), it can dial into an exact hyperparameter quickly with only a few evaluations.

YMMV, but when I compare it to sklearn's LogisticRegressionCV with default parameters, it runs in a fraction of the time.

t1 = time.time()
model = bbai.glm.LogisticRegression()
model.fit(X, y)
t2 = time.time()
print('***** approximate leave-one-out optimization')
print('C = ', model.C_)
print('time = ', (t2 - t1))

from sklearn.linear_model import LogisticRegressionCV
print('***** sklearn.LogisticRegressionCV')
t1 = time.time()
model = LogisticRegressionCV(scoring='neg_log_loss', random_state=0)
model.fit(X, y)
t2 = time.time()
print('C = ', model.C_[0])
print('time = ', (t2 - t1))

Prints

***** approximate leave-one-out optimization
C =  0.6655139682151275
time =  0.03996014595031738
***** sklearn.LogisticRegressionCV
C =  0.3593813663804626
time =  0.2602980136871338

How it works

Approximate leave-one-out cross-validation (ALOOCV) is a close approimxation to leave-one-out cross-validation that's much more efficient to evaluate for generalized linear models.

It first fits the regularized model. Then uses a single step of Newton's algorithm to approximate what the model weights would be when we leave a single data point out. If the regularized cost function for the generalized linear model is represented as

Then the ALOOCV can be computed as

where

(Note: H represents the hessian of the cost function at the optimal weights)

For more background on ALOOCV, you can check out this guide.

It's also possible to compute exact derivatives for ALOOCV which makes it efficient to optimize.

I won't put the derivative formulas here as they are quite involved, but see the paper Optimizing Approximate Leave-one-out Cross-validation.

If we plot out ALOOCV and compare to leave-one-out cross-validation for the example data set, you can see that it tracks it very closely and the ALOOCV optimum is nearly the same as the LOOCV optimum.

Compute Leave-one-out Cross-validation

import numpy as np

def compute_loocv(X, y, C):
    model = bbai.glm.LogisticRegression(C=C)
    n = len(y)
    loo_likelihoods = []
    for i in range(n):
        train_indexes = [i_p for i_p in range(n) if i_p != i]
        test_indexes = [i]
        X_train, X_test = X[train_indexes], X[test_indexes]
        y_train, y_test = y[train_indexes], y[test_indexes]
        model.fit(X_train, y_train)
        pred = model.predict_proba(X_test)[0]
        loo_likelihoods.append(pred[y_test[0]])
    return sum(np.log(loo_likelihoods))

Compute Approximate Leave-one-out Cross-validation

import scipy

def fit_logistic_regression(X, y, C):
    model = bbai.glm.LogisticRegression(C=C)
    model.fit(X, y)
    return np.array(list(model.coef_[0]) + list(model.intercept_))

def compute_hessian(p_vector, X, alpha):
    n, k = X.shape
    a_vector = np.sqrt((1 - p_vector)*p_vector)
    R = scipy.linalg.qr(a_vector.reshape((n, 1))*X, mode='r')[0]
    H = np.dot(R.T, R)
    for i in range(k-1):
        H[i, i] += alpha
    return H

def compute_alo(X, y, C):
    alpha = 1.0 / C
    w = fit_logistic_regression(X, y, C)
    X = np.hstack((X, np.ones((X.shape[0], 1))))
    n = X.shape[0]
    y = 2*y - 1
    u_vector = np.dot(X, w)
    p_vector = scipy.special.expit(u_vector*y)
    H = compute_hessian(p_vector, X, alpha)
    L = np.linalg.cholesky(H)
    T = scipy.linalg.solve_triangular(L, X.T, lower=True)
    h_vector = np.array([np.dot(ti, ti) for pi, ti in zip(p_vector, T.T)])
    loo_u_vector = u_vector - \
        y * (1 - p_vector)*h_vector / (1 - p_vector*(1 - p_vector)*h_vector)
    loo_likelihoods = scipy.special.expit(y*loo_u_vector)
    return sum(np.log(loo_likelihoods))

Plot out the results (along with the ALOOCV optimum)

import matplotlib.pyplot as plt

Cs = np.arange(0.1, 2.0, 0.1)
loocvs = [compute_loocv(X, y, C) for C in Cs]
alos = [compute_alo(X, y, C) for C in Cs]

fig, ax = plt.subplots()
ax.plot(Cs, loocvs, label='LOOCV', marker='o')
ax.plot(Cs, alos, label='ALO', marker='x')
ax.axvline(model.C_, color='tab:green', label='C_opt')
ax.set_xlabel('C')
ax.set_ylabel('Log-Likelihood')
ax.set_title("Breast Cancer Dataset")
ax.legend()

Displays

enter image description here