Python: weighted median algorithm with pandas

This function generalizes proofreader's solution:

def weighted_median(df, val, weight):
    df_sorted = df.sort_values(val)
    cumsum = df_sorted[weight].cumsum()
    cutoff = df_sorted[weight].sum() / 2.
    return df_sorted[cumsum >= cutoff][val].iloc[0]

In this example it'd be weighted_median(df, 'impwealth', 'indweight').


Have you tried the wquantiles package? I had never used it before, but it has a weighted median function that seems to give at least a reasonable answer (you'll probably want to double check that it's using the approach you expect).

In [12]: import weighted

In [13]: weighted.median(df['impwealth'], df['indweight'])
Out[13]: 914662.0859091772

If you want to do this in pure pandas, here's a way. It does not interpolate either. (@svenkatesh, you were missing the cumulative sum in your pseudocode)

df.sort_values('impwealth', inplace=True)
cumsum = df.indweight.cumsum()
cutoff = df.indweight.sum() / 2.0
median = df.impwealth[cumsum >= cutoff].iloc[0]

This gives a median of 925000.