balance numpy array with over-sampling

This gives a random distribution with equal probability for each class:

distrib = np.bincount(a[:,-1])
prob = 1/distrib[a[:, -1]].astype(float)
prob /= prob.sum()

In [38]: a[np.random.choice(np.arange(len(a)), size=np.count_nonzero(distrib)*distrib.max(), p=prob)]
Out[38]: 
array([[ 5, 50, 46,  0],
       [ 4, 11,  8,  1],
       [ 0, 10, 92,  9],
       [ 0, 10, 92,  9],
       [ 2, 29, 30,  1],
       [ 0, 10, 92,  9],
       [ 3, 92,  1,  0],
       [ 1,  7, 89,  1],
       [ 1,  7, 89,  1]])

Each class has equal probability, not guaranteed equal incidence.


The following code does what you are after:

a = np.array([[  2,  29,  30,   1],
              [  5,  50,  46,   0],
              [  1,   7,  89,   1],
              [  0,  10,  92,   9],
              [  4,  11,   8,   1],
              [  3,  92,   1,   0]])

unq, unq_idx = np.unique(a[:, -1], return_inverse=True)
unq_cnt = np.bincount(unq_idx)
cnt = np.max(unq_cnt)
out = np.empty((cnt*len(unq),) + a.shape[1:], a.dtype)
for j in xrange(len(unq)):
    indices = np.random.choice(np.where(unq_idx==j)[0], cnt)
    out[j*cnt:(j+1)*cnt] = a[indices]

>>> out
array([[ 5, 50, 46,  0],
       [ 5, 50, 46,  0],
       [ 5, 50, 46,  0],
       [ 1,  7, 89,  1],
       [ 4, 11,  8,  1],
       [ 2, 29, 30,  1],
       [ 0, 10, 92,  9],
       [ 0, 10, 92,  9],
       [ 0, 10, 92,  9]])

When numpy 1.9 is released, or if you compile from the development branch, then the first two lines can be condensed into:

unq, unq_idx, unq_cnt = np.unique(a[:, -1], return_inverse=True,
                                  return_counts=True)

Note that, the way np.random.choice works, there is no guarantee that all rows of the original array will be present in the output one, as the example above shows. If that is needed, you could do something like:

unq, unq_idx = np.unique(a[:, -1], return_inverse=True)
unq_cnt = np.bincount(unq_idx)
cnt = np.max(unq_cnt)
out = np.empty((cnt*len(unq) - len(a),) + a.shape[1:], a.dtype)
slices = np.concatenate(([0], np.cumsum(cnt - unq_cnt)))
for j in xrange(len(unq)):
    indices = np.random.choice(np.where(unq_idx==j)[0], cnt - unq_cnt[j])
    out[slices[j]:slices[j+1]] = a[indices]
out = np.vstack((a, out))

>>> out
array([[ 2, 29, 30,  1],
       [ 5, 50, 46,  0],
       [ 1,  7, 89,  1],
       [ 0, 10, 92,  9],
       [ 4, 11,  8,  1],
       [ 3, 92,  1,  0],
       [ 5, 50, 46,  0],
       [ 0, 10, 92,  9],
       [ 0, 10, 92,  9]])