Filter Pandas dataframe based on combination of two columns

The tuple comparison approach as outlined by @Vivek Kalyanarangan is the way to go but the speed can be significantly increased in case of large dataframes by utilizing the MultiIndex instead of using an apply function for tuple creation:

For example, in your case:

keep_tuples = [(1,2), (4,3)]
tuples_in_df = pd.MultiIndex.from_frame(df[["a","b"]])
df[tuples_in_df.isin(keep_tuples)]

This leads to ~5X speed improvement on a 1,000,000 X 2 sized df when compared to using apply function.

Use -

df[df[['a', 'b']].apply(tuple, axis=1).isin([(1,2), (4,3)])]

Output

    a   b
0   1   2
3   4   3

Explanation

df[['a', 'b']].apply(tuple, axis=1) gives a series of tuples -

0    (1, 2)
1    (2, 3)
2    (4, 2)
3    (4, 3)

.isin([(1,2), (4,3)]) searches for the desired tuples and gives a boolean series

Another idea is to join the two columns (a and b) as strings, and check for 12 and 43, i.e.

df[df.astype(str).sum(axis = 1).isin([12, 43])]
#   a  b
#0  1  2
#3  4  3

Filter Pandas dataframe based on combination of two columns

Tags:

Python

Pandas

Related

Recent Posts