How to get balanced sample of classes from an imbalanced dataset in sklearn?

As you didn't provide us with the dataset, I'm using mock data generated by means of make_blobs. It remains unclear from your question how many test samples there should be. I've defined test_samples = 50000 but you can change this value to fit your needs.

from sklearn import datasets

train_samples = 5000
test_samples = 50000
total_samples = train_samples + train_samples
X, y = datasets.make_blobs(n_samples=total_samples, centers=2, random_state=0)

The following snippet splits data into train and test with balanced classes:

from sklearn.model_selection import StratifiedShuffleSplit    

sss = StratifiedShuffleSplit(train_size=train_samples, n_splits=1, 
                             test_size=test_samples, random_state=0)  

for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Demo:

In [54]: from scipy import stats

In [55]: stats.itemfreq(y_train)
Out[55]: 
array([[   0, 2500],
       [   1, 2500]], dtype=int64)

In [56]: stats.itemfreq(y_test)
Out[56]: 
array([[    0, 25000],
       [    1, 25000]], dtype=int64)

EDIT

As @geompalik correctly pointed out, if your dataset is unbalanced StratifiedShuffleSplit won't yield balanced splits. In that case you might find this function useful:


def stratified_split(y, train_ratio):
    
    def split_class(y, label, train_ratio):
        indices = np.flatnonzero(y == label)
        n_train = int(indices.size*train_ratio)
        train_index = indices[:n_train]
        test_index = indices[n_train:]
        return (train_index, test_index)
        
    idx = [split_class(y, label, train_ratio) for label in np.unique(y)]
    train_index = np.concatenate([train for train, _ in idx])
    test_index = np.concatenate([test for _, test in idx])
    return train_index, test_index

Demo:

I have previuosuly generated mock data with the number of samples per class you indicated (code not shown here).

In [153]: y
Out[153]: array([1, 0, 1, ..., 0, 0, 1])

In [154]: y.size
Out[154]: 55000

In [155]: train_ratio = float(train_samples)/(train_samples + test_samples)  

In [156]: train_ratio
Out[156]: 0.09090909090909091

In [157]: train_index, test_index = stratified_split(y, train_ratio)

In [158]: y_train = y[train_index]

In [159]: y_test = y[test_index]

In [160]: y_train.size
Out[160]: 5000

In [161]: y_test.size
Out[161]: 50000

In [162]: stats.itemfreq(y_train)
Out[162]: 
array([[   0, 2438],
       [   1, 2562]], dtype=int64)

In [163]: stats.itemfreq(y_test)
Out[163]: 
array([[    0, 24380],
       [    1, 25620]], dtype=int64)

The problem is that the StratifiedShuffleSplit method you use by definition splits by preserving the percentages of the class by definition (stratification).

A straightforward way to achieve what you want while using StratifiedShuffleSplit is to subsample the dominant class first, so that the initial dataset is balanced and then continue. Using numpy this is easy to accomplish. Although the splits you describe are almost balanced.

Tags:

Scikit Learn