What is the fastest way to save a large pandas DataFrame to S3?
Use multi-part uploads to make the transfer to S3 faster. Compression makes the file smaller, so that will help too.
import boto3
s3 = boto3.client('s3')
csv_buffer = BytesIO()
df.to_csv(csv_buffer, compression='gzip')
# multipart upload
# use boto3.s3.transfer.TransferConfig if you need to tune part size or other settings
s3.upload_fileobj(csv_buffer, bucket, key)
The docs for s3.upload_fileobj
are here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_fileobj
You can try using s3fs
with pandas
compression to upload to S3. StringIO
or BytesIO
are memory hogging.
import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(anon=False)
df = pd.read_csv("some_large_file")
with s3.open('s3://bucket/file.csv.gzip','w') as f:
df.to_csv(f, compression='gzip')