How to implement a Boolean search with multiple columns in pandas

All the considerations made by @EdChum in 2014 are still valid, but the pandas.Dataframe.ix method is deprecated from the version 0.0.20 of pandas. Directly from the docs:

Warning: Starting in 0.20.0, the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

In subsequent versions of pandas, this method has been replaced by new indexing methods pandas.Dataframe.loc and pandas.Dataframe.iloc.

If you want to learn more, in this post you can find comparisons between the methods mentioned above.

Ultimately, to date (and there does not seem to be any change in the upcoming versions of pandas from this point of view), the answer to this question is as follows:

foo = df.loc[(df['column1']==value) | (df['columns2'] == 'b') | (df['column3'] == 'c')]

You need to enclose multiple conditions in braces due to operator precedence and use the bitwise and (&) and or (|) operators:

foo = df[(df['column1']==value) | (df['columns2'] == 'b') | (df['column3'] == 'c')]

If you use and or or, then pandas is likely to moan that the comparison is ambiguous. In that case, it is unclear whether we are comparing every value in a series in the condition, and what does it mean if only 1 or all but 1 match the condition. That is why you should use the bitwise operators or the numpy np.all or np.any to specify the matching criteria.

There is also the query method: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.query.html

but there are some limitations mainly to do with issues where there could be ambiguity between column names and index values.


A more concise--but not necessarily faster--method is to use DataFrame.isin() and DataFrame.any()

In [27]: n = 10

In [28]: df = DataFrame(randint(4, size=(n, 2)), columns=list('ab'))

In [29]: df
Out[29]:
   a  b
0  0  0
1  1  1
2  1  1
3  2  3
4  2  3
5  0  2
6  1  2
7  3  0
8  1  1
9  2  2

[10 rows x 2 columns]

In [30]: df.isin([1, 2])
Out[30]:
       a      b
0  False  False
1   True   True
2   True   True
3   True  False
4   True  False
5  False   True
6   True   True
7  False  False
8   True   True
9   True   True

[10 rows x 2 columns]

In [31]: df.isin([1, 2]).any(1)
Out[31]:
0    False
1     True
2     True
3     True
4     True
5     True
6     True
7    False
8     True
9     True
dtype: bool

In [32]: df.loc[df.isin([1, 2]).any(1)]
Out[32]:
   a  b
1  1  1
2  1  1
3  2  3
4  2  3
5  0  2
6  1  2
8  1  1
9  2  2

[8 rows x 2 columns]

Tags:

Python

Pandas