How does the predict_proba() function in LightGBM work internally?

LightGBM, like all gradient boosting methods for classification, essentially combines decision trees and logistic regression. We start with the same logistic function representing the probabilities (a.k.a. softmax):

P(y = 1 | X) = 1/(1 + exp(Xw))

The interesting twist is that the feature matrix X is composed from the terminal nodes from a decision tree ensemble. These are all then weighted by w, a parameter that must be learned. The mechanism used to learn the weights depends on the precise learning algorithm used. Similarly, the construction of X also depends on the algorithm. LightGBM, for example, introduced two novel features which won them the performance improvements over XGBoost: "Gradient-based One-Side Sampling" and "Exclusive Feature Bundling". Generally though, each row collects the terminal leafs for each sample and the columns represent the terminal leafs.

So here is what the docs could say...

Probability estimates.

The predicted class probabilities of an input sample are computed as the softmax of the weighted terminal leaves from the decision tree ensemble corresponding to the provided sample.

For further details, you'd have to delve into the details of boosting, XGBoost, and finally the LightGBM paper, but that seems a bit heavy handed given the other documentation examples you've given.


Short Explanation

Below we can see an illustration of what each method is calling under the hood. First, the predict_proba() method of the class LGBMClassifier is calling the predict() method from LGBMModel (it inherits from it).

LGBMClassifier.predict_proba() (inherits from LGBMModel)
  |---->LGBMModel().predict() (calls LightGBM Booster)
          |---->Booster.predict()

Then, it calls the predict() method from the LightGBM Booster (the Booster class). In order to call this method, the Booster should be trained first.

Basically, the Booster is the one that generates the predicted value for each sample by calling it's predict() method. See below, for a detailed follow up of how this booster works.

Detailed Explanation or How does the LightGBM Booster works?

We seek to answer the question how does LightGBM booster works?. By going through the Python code we can get a general idea of how it is trained and updated. But, there are some further references to the C++ libraries of LightGBM that I'm not in a position to explain. However, a general glimpse of LightGBM's Booster workflow is explained.

A. Initializing and Training the Booster

The _Booster of LGBMModel is initialized by calling the train() function, on line 595 of sklearn.py we see the following code

self._Booster = train(params, train_set,
                      self.n_estimators, valid_sets=valid_sets, valid_names=eval_names,
                      early_stopping_rounds=early_stopping_rounds,
                      evals_result=evals_result, fobj=self._fobj, feval=feval,
                      verbose_eval=verbose, feature_name=feature_name,
                      callbacks=callbacks, init_model=init_model)

Note. train() comes from engine.py.

Inside train() we see that the Booster is initialized (line 231)

# construct booster
try:
    booster = Booster(params=params, train_set=train_set)
...

and updated at every training iteration (line 242).

for i in range_(init_iteration, init_iteration + num_boost_round):
     ...
     ... 
     booster.update(fobj=fobj)
     ...

B. How does booster.update() works?

To understand how the update() method works we should go to line 2315 of basic.py. Here, we see that this function updates the Booster for one iteration.

There two alternatives to update the booster, depending on wether or not you provide an objective function.

  • Objective Function is None

On line 2367 we get to the following code

if fobj is None:
    ...
    ...
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
               self.handle,
               ctypes.byref(is_finished)))
    self.__is_predicted_cur_iter = [False for _ in range_(self.__num_dataset)]
    return is_finished.value == 1

notice that as the objective function (fobj) is not provided it updates the booster by calling LGBM_BoosterUpdateOneIter from _LIB. For short, _LIB are the loaded C++ LightGBM libraries.

What is _LIB?

_LIB is a variable that stores the loaded LightGBM library by calling _load_lib() (line 29 of basic.py).

Then _load_lib() loads the LightGBM library by finding on your system the path to lib_lightgbm.dll(Windows) or lib_lightgbm.so (Linux).

  • Objective Function provided

When a custom object function is encountered, we get to the following case

else:
    ...
    ...
    grad, hess = fobj(self.__inner_predict(0), self.train_set)

where __inner_predict() is a method from LightGBM's Booster (see line 1930 from basic.py for more details of the Booster class), which predicts for training and validation data. Inside __inner_predict() (line 3142 of basic.py) we see that it calls LGBM_BoosterGetPredict from _LIB to get the predictions, that is,

_safe_call(_LIB.LGBM_BoosterGetPredict(
                self.handle,
                ctypes.c_int(data_idx),
                ctypes.byref(tmp_out_len),
                data_ptr))

Finally, after updating range_(init_iteration, init_iteration + num_boost_round) times the booster it will be trained. Thus, Booster.predict() can be called by LightGBMClassifier.predict_proba().

Note. The booster is trained as part of the model fitting step, especifically by LGBMModel.fit(), see line 595 of sklearn.py for code details.