pandas qcut not putting equal number of observations into each bin

qcut is trying to compensate for repeating values. This is earlier to visualize if you return the bin limits along with your qcut results:

In [42]: test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
In [43]: test_series = pd.Series(test_list, name='value_rank')

In [49]: pd.qcut(test_series, 5, retbins=True, labels=False)
Out[49]:
(array([0, 0, 1, 1, 1, 2, 3, 3, 4, 4]),
 array([ 11. ,  25.2,  30. ,  33. ,  41. ,  53. ]))

You can see that there was no choice but to set the bin limit at 30, so qcut had to "steal" one from the expected values in the third bin and place them in the second. I'm thinking that this is just happening at a larger scale with your percentiles since you're basically condensing their ranks into a 1 to 100 scale. Any reason not to just run qcut directly on the data instead of the percentiles or return percentiles that have greater precision?

If you must get equal (or nearly equal) bins, then here's a trick you can use with qcut. Using the same data as the accepted answer, we can force these into equal bins by adding some random noise to the original test_list and binning according to those values.

test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]

np.random.seed(42) #set this for reproducible results
test_list_rnd = np.array(test_list) + np.random.random(len(test_list)) #add noise to data

test_series = pd.Series(test_list_rnd, name='value_rank')
pd.qcut(test_series, 5, retbins=True, labels=False)

Output:

(0    0
 1    0
 2    1
 3    2
 4    1
 5    2
 6    3
 7    3
 8    4
 9    4
 Name: value_rank, dtype: int64,
 array([ 11.37454012,  25.97573801,  30.42160255,  33.11683016,
         41.81316392,  53.70807258]))

So, now we have two 0's, two 1's, two 2's and two 4's!

Disclaimer

Obviously, use this at your discretion because results can vary based on your data; like how large your data set is and/or the spacing, for instance. The above "trick" works well for integers because even though we are "salting" the test_list, it will still rank order in the sense that there will won't be a value in group 0 greater than a value in group 1 (maybe equal, but not greater). If, however, you have floats, this may be tricky and you may have to reduce the size of your noise accordingly. For instance if you had floats like 2.1, 5.3, 5.3, 5.4, etc., you should should reduce the noise by dividing by 10: np.random.random(len(test_list)) / 10. If you have arbitrarily long floats, however, you probably would not have this problem in the first place, given the noise already present in "real" data.

Just try with the below code :

pd.qcut(df.rank(method='first'),nbins)

pandas qcut not putting equal number of observations into each bin

Tags:

Python

Pandas

Binning

Related

Recent Posts