4 ways to select rows from a DataFrame based on column values

There are several ways to select rows from a Pandas dataframe:

  1. Boolean indexing (DataFrame[DataFrame['col'] == value])
  2. Positional indexing (DataFrame.iloc[...])
  3. Label indexing (DataFrame.xs(...))
  4. DataFrame.query(...) API

Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column 'A' == 'bar'

Setup

The first thing we'll need is to identify a condition that will act as our criterion for selecting rows. We'll start with the OP's case column_name == some_value, and include some other common use cases.

An example would be:


import pandas as pd, numpy as np

df = pd.DataFrame({
    'A':
    'Contrary bar popular bar Lorem bar Ipsum is not simply'.split(),
    'B':
    'Lorem Ipsum comes from sections one two three four five'.split(),
    'C':
    np.arange(10),
    'D':
    np.arange(10) * 2
})

print(df)

Output:

          A         B  C   D
0  Contrary     Lorem  0   0
1       bar     Ipsum  1   2
2   popular     comes  2   4
3       bar      from  3   6
4     Lorem  sections  4   8
5       bar       one  5  10
6     Ipsum       two  6  12
7        is     three  7  14
8       not      four  8  16
9    simply      five  9  18

Method 1: Boolean indexing (DataFrame[DataFrame['col'] == value] ) #

This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn't an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.

mask = df['A'] == 'bar'
print(mask)

Output:

     A      B  C   D
1  bar  Ipsum  1   2
3  bar   from  3   6
5  bar    one  5  10

Method 2: Positional indexing (DataFrame.iloc[...]) #

Positional indexing (DataFrame.iloc[...]) has its use cases, but this isn't one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.

mask = df['A'] == 'bar'
pos = np.flatnonzero(mask)
print(df.iloc[pos])

Output as same as the previous method:

     A      B  C   D
1  bar  Ipsum  1   2
3  bar   from  3   6
5  bar    one  5  10

Method 3: Label indexing (DataFrame.xs(...)) #

Label indexing can be very handy, but in this case, we are again doing more work for no benefit

res = df.set_index('A', append=True, drop=False).xs('bar', level=1)

print(res)

Output as same as the previous methods:

     A      B  C   D
1  bar  Ipsum  1   2
3  bar   from  3   6
5  bar    one  5  10

Method 4: DataFrame.query(...) API #

pd.DataFrame.query is a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.

res = df.query('A == "bar"')

print(res)

Output as same as the previous methods:

     A      B  C   D
1  bar  Ipsum  1   2
3  bar   from  3   6
5  bar    one  5  10

Tags:

Python

Pandas