python bin data and return bin midpoint (maybe using pandas.cut and qcut)

I see that this is an old post but I will take the liberty to answer it anyway.

It is now possible (ref @chrisb's answer) to access the endpoints for categorical intervals using left and right.

s = pd.cut(pd.Series(np.arange(11)), bins = 5)

mid = [(a.left + a.right)/2 for a in s]
Out[34]: [0.995, 0.995, 0.995, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0]

Since intervals are open to the left and closed to the right, the 'first' interval (the one starting at 0), actually starts at -0.01. To get a midpoint using 0 as the left value you can do this

mid_alt = [(a.left + a.right)/2 if a.left != -0.01 else a.right/2 for a in s]
Out[35]: [1.0, 1.0, 1.0, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0]

Or, you can say that the intervals are closed to the left and open to the right

t = pd.cut(pd.Series(np.arange(11)), bins = 5, right=False)
Out[38]: 
0       [0.0, 2.0)
1       [0.0, 2.0)
2       [2.0, 4.0)
3       [2.0, 4.0)
4       [4.0, 6.0)
5       [4.0, 6.0)
6       [6.0, 8.0)
7       [6.0, 8.0)
8     [8.0, 10.01)
9     [8.0, 10.01)
10    [8.0, 10.01)

But, as you see, you get the same problem at the last interval.

I noticed that a category has a mid property, so you can calculate the middle via an apply:

In [1]: import pandas as pd
   ...: import numpy as np
   ...: df = pd.DataFrame({"val":np.arange(11)})
   ...: df["bins"] = pd.cut(df["val"], bins = 5)
   ...: df["bin_centres"] = df["bins"].apply(lambda x: x.mid)
   ...: df
Out[1]:
    val          bins bin_centres
0     0  (-0.01, 2.0]       0.995
1     1  (-0.01, 2.0]       0.995
2     2  (-0.01, 2.0]       0.995
3     3    (2.0, 4.0]       3.000
4     4    (2.0, 4.0]       3.000
5     5    (4.0, 6.0]       5.000
6     6    (4.0, 6.0]       5.000
7     7    (6.0, 8.0]       7.000
8     8    (6.0, 8.0]       7.000
9     9   (8.0, 10.0]       9.000
10   10   (8.0, 10.0]       9.000

There's a work-in-progress proposal for an 'IntervalIndex' that would make this type of operation very straightforward.

But for now, you can get the bins by passing the retbins argument and calculate the midpoints.

In [8]: s, bins = pd.cut(pd.Series(np.arange(11)), bins = 5, retbins=True)

In [11]: mid = [(a + b) /2 for a,b in zip(bins[:-1], bins[1:])]

In [13]: s.cat.rename_categories(mid)
Out[13]: 
0     0.995
1     0.995
2     0.995
3     3.000
4     3.000
5     5.000
6     5.000
7     7.000
8     7.000
9     9.000
10    9.000
dtype: category
Categories (5, float64): [0.995 < 3.000 < 5.000 < 7.000 < 9.000]

python bin data and return bin midpoint (maybe using pandas.cut and qcut)

Tags:

Python

Pandas

Binning

Related

Recent Posts