How can I filter lines on load in Pandas read_csv function?

There isn't an option to filter the rows before the CSV file is loaded into a pandas object.

You can either load the file and then filter using df[df['field'] > constant], or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

You can vary the chunksize to suit your available memory. See here for more details.

I didn't find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:

filtered = df[(df['timestamp'] > targettime)]

This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv call, that at least contains a datetime column timestamp) for which the values in the timestamp column are greater than the value of targettime. Similar question.

How can I filter lines on load in Pandas read_csv function?

Tags:

Python

Pandas

Related

Recent Posts