Pandas: Filter dataframe for values that are too frequent or too rare

I am new to Python and using Pandas. I came up with the following solution below. Maybe other people might have a better or more efficient approach.

Assuming your DataFrame is DF, you can use the following code below to filter out all infrequent values. Just be sure to update the col and bin_freq variable. DF_Filtered is your new filtered DataFrame.

# Column you want to filter
col = 'time of day'

# Set your frequency to filter out. Currently set to 5%  
bin_freq = float(5)/float(100)

DF_Filtered = pd.DataFrame()

for i in DF[col].unique():
    counts = DF[DF[col]==i].count()[col] 
    total_counts = DF[col].count()
    freq  = float(counts)/float(total_counts)

    if freq > bin_freq:
        DF_Filtered = pd.concat([DF[DF[col]==i],DF_Filtered])

print DF_Filtered

I would go with one of the following:

Option A

m = 0.03 * len(df)
df[np.all(
    df.apply(
        lambda c: c.isin(c.value_counts()[c.value_counts() > m].index).as_matrix()), 
    axis=1)]

Explanation:

  • m = 0.03 * len(df) is the threshold (it's nice to take the constant out of the complicated expression)

  • df[np.all(..., axis=1)] retains the rows where some condition was obtained across all columns.

  • df.apply(...).as_matrix applies a function to all columns, and makes a matrix of the results.

  • c.isin(...) checks, for each column item, whether it is in some set.

  • c.value_counts()[c.value_counts() > m].index is the set of all values in a column whose count is above m.

Option B

m = 0.03 * len(df)
for c in df.columns:
    df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]

The explanation is similar to the one above.


Tradeoffs:

  • Personally, I find B more readable.

  • B creates a new DataFrame for each filtering of a column; for large DataFrames, it's probably more expensive.


This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop.

This answer is similar to that provided by @Ami Tavory, but with a few subtle differences:

  • It normalizes the value counts so you can just use a percentile threshold.
  • It calculates counts just once per column instead of twice. This results in faster execution.

Code:

threshold = 0.03
for col in df:
    counts = df[col].value_counts(normalize=True)
    df = df.loc[df[col].isin(counts[counts > threshold].index), :]

Code timing:

df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True), 
                   columns=list('ABCD'))

%%timeit df=df2.copy()
threshold = 0.03
for col in df:
    counts = df[col].value_counts(normalize=True)
    df = df.loc[df[col].isin(counts[counts > threshold].index), :]

1 loops, best of 3: 485 ms per loop

%%timeit df=df2.copy()
m = 0.03 * len(df)
for c in df:
    df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]

1 loops, best of 3: 688 ms per loop