What is the fastest way to save a large pandas DataFrame to S3?

You can try using s3fs with pandas compression to upload to S3. StringIO or BytesIO are memory hogging.

import s3fs
import pandas as pd

s3 = s3fs.S3FileSystem(anon=False)
df = pd.read_csv("some_large_file")
with s3.open('s3://bucket/file.csv.gzip','w') as f:
    df.to_csv(f, compression='gzip')

It really depends on the content, but that is not related to boto3. Try first to dump your DataFrame locally and see what's fastest and what size you get.

Here are some suggestions that we have found to be fast, for cases between a few MB to over 2GB (although, for more than 2GB, you really want parquet and possibly split it into a parquet dataset):

Lots of mixed text/numerical data (SQL-oriented content): use df.to_parquet(file).
Mostly numerical data (e.g. if your columns df.dtypes indicate a happy numpy array of a single type, not Object): you can try df_to_hdf(file, 'key').

One bit of advice: try to split your df in some shards that are meaningful to you (e.g., by time for timeseries). Especially if you have a lot of updates to a single shard (e.g. the last one in a time series), it will make your download/upload much faster.

What we have found is that, HDF5 are bulkier (uncompressed), but they save/load fantastically fast from/into memory. Parquets are by default snappy-compressed, so they tend to be smaller (depending on the entropy of your data, of course; penalty for you if you save totally random numbers).

For boto3 client, both multipart_chunksize and multipart_threshold are 8MB by default, which is often a fine choice. You can check via:

tc = boto3.s3.transfer.TransferConfig()
print(f'chunksize: {tc.multipart_chunksize}, threshold: {tc.multipart_threshold}')

Also, the default is to use 10 threads for each upload (which does nothing unless the size of your object is larger than the threshold above).

Another question is how to upload many files efficiently. That is not handled by any definition in TransferConfig. But I digress, the original question is about a single object.

Use multi-part uploads to make the transfer to S3 faster. Compression makes the file smaller, so that will help too.

import boto3
s3 = boto3.client('s3')

csv_buffer = BytesIO()
df.to_csv(csv_buffer, compression='gzip')

# multipart upload
# use boto3.s3.transfer.TransferConfig if you need to tune part size or other settings
s3.upload_fileobj(csv_buffer, bucket, key)

The docs for s3.upload_fileobj are here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_fileobj

What is the fastest way to save a large pandas DataFrame to S3?

Tags:

Pandas

Amazon S3

Python 3.X

Related

Recent Posts