Pandas drop rows vs filter

The recommended solution is the most eficient, which in this case, is the first one.

df = df[df['A'] >= 0]

On the second solution

selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)

you are repeating the slicing process. But lets break it to pieces to understand why.

When you write

df['A'] >= 0

you are creating a mask, a Boolean Series with an entry for each index of df, whose value is either True or False according to a condition (on this case, if such the value of column 'A' at a given index is greater than or equal to 0).

When you write

df[df['A'] >= 0]

you accessing the rows for which your mask (df['A'] >= 0) is True. This is a slicing method supported by Pandas that lets you select certain rows by passing a Boolean Series and will return a new DataFrame with only the entries for which the Series was True.

Finally, when you write this

selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)

you are repeating the proccess because

df[df['A'] < 0]

is already slicing your DataFrame (in this case for the rows you want to drop). You are then getting those indices, going back to the original DataFrame and explicitly dropping them. No need for this, you already sliced the DataFrame in the first step.

df = df[df['A'] >= 0]

is indeed the faster solution. Just be aware that it returns a view of the original data frame, not a new data frame. This can lead you into trouble, for example when you want to change its values, as pandas will give you the SettingwithCopyWarning.

The simple fix of course is what Wen-Ben recommended:

df = df[df['A'] >= 0].copy()

Pandas drop rows vs filter

Tags:

Python

Pandas

Dataframe

Data Science

Related

Recent Posts