Efficiently Read last 'n' rows of CSV into DataFrame

I don't think pandas offers a way to do this in read_csv.

Perhaps the neatest (in one pass) is to use collections.deque:

from collections import deque
from StringIO import StringIO

with open(fname, 'r') as f:
    q = deque(f, 2)  # replace 2 with n (lines read at the end)

In [12]: q
Out[12]: deque(['7,8,9\n', '10,11,12'], maxlen=2)
         # these are the last two lines of my csv

In [13]: pd.read_csv(StringIO(''.join(q)), header=None)

Another option worth trying is to get the number of lines in a first pass and then read the file again, skip that number of rows (minus n) using read_csv...

Files are simply streams of bytes. Lines do not exist as separate entities; they are an artifact of treating certain bytes as newline characters. As such, you must read from the beginning of the file to identify lines in order.

If the file doesn't change (often) and this is an operation you need to perform often (say, with different values of n), you can store the byte offsets of the newline characters in a second file. You can use this much-smaller file and the seek command to quickly jump to a given line in the first file and read from there.

(Some operating systems provide record-oriented files that have more complex internal structure than the common flat file. The above does not apply to them.)

Here's a handy way to do. Works well for what I like to do -

import tailer
import pandas as pd
import io

with open(filename) as file:
    last_lines = tailer.tail(file, 15)

df = pd.read_csv(io.StringIO('\n'.join(last_lines)), header=None)

You need to install tailer, to have this working:

pip install --user tailer

Efficiently Read last 'n' rows of CSV into DataFrame

Tags:

Python

Pandas

Csv

Numpy

Related

Recent Posts