Pull large amounts of data from a remote server, into a DataFrame

I suspect there's a couple of (related) things at play here causing slowness:

read_sql is written in python so it's a little slow (especially compared to read_csv, which is written in cython - and carefully implemented for speed!) and it relies on sqlalchemy rather than some (potentially much faster) C-DBAPI. The impetus to move to sqlalchmey was to make that move easier in the future (as well as cross-sql-platform support).
You may be running out of memory as too many python objects are in memory (this is related to not using a C-DBAPI), but potentially could be addressed...

I think the immediate solution is a chunk-based approach (and there is a feature request to have this work natively in pandas read_sql and read_sql_table).

EDIT: As of Pandas v0.16.2 this chunk based approach is natively implemented in read_sql.

Since you're using postgres you have access the the LIMIT and OFFSET queries, which makes chunking quite easy. (Am I right in thinking these aren't available in all sql languages?)

First, get the number of rows (or an estimate) in your table:

nrows = con.execute('SELECT count(*) FROM users').fetchone()[0]  # also works with an sqlalchemy engine

Use this to iterate through the table (for debugging you could add some print statements to confirm that it was working/not crashed!) and then combine the result:

def read_sql_chunked(query, con, nrows, chunksize=1000):
    start = 1
    dfs = []  # Note: could probably make this neater with a generator/for loop
    while start < nrows:
        df = pd.read_sql("%s LIMIT %s OFFSET %s" % (query, chunksize, start), con)
        dfs.append(df)
    return pd.concat(dfs, ignore_index=True)

Note: this assumes that the database fits in memory! If it doesn't you'll need to work on each chunk (mapreduce style)... or invest in more memory!

Pull large amounts of data from a remote server, into a DataFrame

Tags:

Python

Pandas

Postgresql

Psycopg2

Related

Recent Posts