Training on imbalanced data using TensorFlow

Regarding imbalanced datasets, the first two methods that come to mind are (upweighting positive samples, sampling to achieve balanced batch distributions).

Upweighting positive samples This refers to increasing the losses of misclassified positive samples when training on datasets that have much fewer positive samples. This incentivizes the ML algorithm to learn parameters that are better for positive samples. For binary classification, there is a simple API in tensorflow that achieves this. See (weighted_cross_entropy) referenced below

  • https://www.tensorflow.org/api_docs/python/tf/nn/weighted_cross_entropy_with_logits

Batch Sampling This involves sampling the dataset so that each batch of training data has an even distribution positive samples to negative samples. This can be done using the rejections sampling API provided from tensorflow.

  • https://www.tensorflow.org/api_docs/python/tf/contrib/training/rejection_sample

I'm one who struggling with imbalanced data. What my strategy to counter imbalanced data are as below.

1) Use cost function calculating 0 and 1 labels at the same time like below.

cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(_pred) + (1-y)*tf.log(1-_pred), reduction_indices=1))

2) Use SMOTE, oversampling method making number of 0 and 1 labels similar. Refer to here, http://comments.gmane.org/gmane.comp.python.scikit-learn/5278

Both strategy worked when I tried to make credit rating model.

Logistic regression is typical method to handle imbalanced data and binary classification such as predicting default rate. AUROC is one of the best metric to counter imbalanced data.


(1)It's ok to use your strategy. I'm working with imbalanced data as well, which I try to use down-sampling and up-sampling methods first to make the training set even distributed. Or using ensemble method to train each classifier with an even distributed subset.

(2)I haven't seen any method to maximise the AUROC. My thought is that AUROC is based on true positive and false positive rate, which doesn't tell how well it works on each instance. Thus, it may not necessarily maximise the capability to separate the classes.

(3)Regarding weighting the cost by the ratio of class instances, it similar to Loss function for class imbalanced binary classifier in Tensor flow and the answer.