Python Pandas - difference between 'loc' and 'where'?

Think of loc as a filter - give me only the parts of the df that conform to a condition.

where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.

ID  Run Distance Goals Gender
0   1   234      12     m
1   2   35       23     m
2   3   77       56     m
3   0   0        0      0
4   0   0        0      0
5   0   0        0      0
6   0   0        0      0
7   0   0        0      0
8   0   0        0      0
9   10  123      34     m

Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:

    Gender  Goals
0   m   12
1   m   23

If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:

df2 = df.where(df['Goals']>10)
print (df2)
     ID  Run Distance  Goals Gender
0   1.0         234.0   12.0      m
1   2.0          35.0   23.0      m
2   3.0          77.0   56.0      m
3   NaN           NaN    NaN    NaN
4   NaN           NaN    NaN    NaN
5   NaN           NaN    NaN    NaN
6   NaN           NaN    NaN    NaN
7   NaN           NaN    NaN    NaN
8   NaN           NaN    NaN    NaN
9  10.0         123.0   34.0      m

df2 = df.where(df['Goals']>10, 100)
print (df2)
    ID  Run Distance  Goals Gender
0    1           234     12      m
1    2            35     23      m
2    3            77     56      m
3  100           100    100    100
4  100           100    100    100
5  100           100    100    100
6  100           100    100    100
7  100           100    100    100
8  100           100    100    100
9   10           123     34      m

Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.

df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]

print (df2)
   ID  Run Distance  Goals Gender
0   1           234     12      m
1   2            35     23      m
2   3            77     56      m
9  10           123     34      m

If use loc is possible also filter by rows by condition and columns by name(s):

s = df.loc[df['Goals']>10, 'ID']
print (s)
0     1
1     2
2     3
9    10
Name: ID, dtype: int64

df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
   ID Gender
0   1      m
1   2      m
2   3      m
9  10      m

  • loc retrieves only the rows that matches the condition.
  • where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).

Tags:

Python

Pandas