Why is pandas nlargest slower than mine?

I guess you can use this :

df.sort_values(by=['SCORE'],ascending=False).groupby('ID').head(2)

This is the same as your manual solution using Sort/head functions on pandas groupby.

t0 = time.time()
df4 = df.sort_values(by=['SCORE'],ascending=False).groupby('ID').head(2)
t1 = time.time()
df4_list = [tuple(x) for x in df4[['ID', 'SCORE', 'CAT']].values]
df4_list = sorted(df4_list, reverse=True)
is_same = df3_list == df4_list
print('SORT/HEAD solution: {:0.2f}s'.format(t1 - t0))
print(is_same)

gives

SORT/HEAD solution: 0.08s
True

timeit

77.9 ms ± 7.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each).

As to why nlargest is slower than the other solutions ?, I guess calling it for each group is creating an overhead (%prun is showing 15764409 function calls (15464352 primitive calls) in 30.293 seconds).

For this solution (1533 function calls (1513 primitive calls) in 0.078 seconds)

Why is pandas nlargest slower than mine?

Tags:

Python

Pandas

Dataframe

Pandas Groupby

Related

Recent Posts