Add a random prefix to the key names to improve S3 performance?

Lookup/writes work means using filenames that are similar or ordered can harm performance.

Adding hashes/random ids prefixing the S3 key is still advisable to alleviate high loads on heavily accessed objects.

Amazon S3 Performance Tips & Tricks

Request Rate and Performance Considerations

As of a 7/17/2018 AWS announcement, hashing and random prefixing the S3 key is no longer required to see improved performance: https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/

S3 prefixes used to be determined by the first 6-8 characters;

This has changed mid-2018 - see announcement https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/

But that is half-truth. Actually prefixes (in old definition) still matter.

S3 is not a traditional “storage” - each directory/filename is a separate object in a key/value object store. And also the data has to be partitioned/ sharded to scale to quadzillion of objects. So yes this new sharding is kinda of “automatic”, but not really if you created a new process that writes to it with crazy parallelism to different subdirectories. Before the S3 learns from the new access pattern, you may run into S3 throttling before it reshards/ repartitions data accordingly.

Learning new access patterns takes time. Repartitioning of the data takes time.

Things did improve in mid-2018 (~10x throughput-wise for a new bucket with no statistics), but it's still not what it could be if data is partitioned properly. Although to be fair, this may not be applied to you if you don't have a ton of data, or pattern how you access data is not hugely parallel (e.g. running a Hadoop/Spark cluster on many Tbs of data in S3 with hundreds+ of tasks accessing same bucket in parallel).

TLDR:

"Old prefixes" still do matter. Write data to root of your bucket, and first-level directory there will determine "prefix" (make it random for example)

"New prefixes" do work, but not initially. It takes time to accommodate to load.

PS. Another approach - you can reach out to your AWS TAM (if you have one) and ask them to pre-partition a new S3 bucket if you expect a ton of data to be flooding it soon.

How to introduce randomness to S3 ?

Prefix folder names with random hex hashes. For example: s3://BUCKET/23a6-FOLDERNAME/FILENAME.zip
Prefix file names with timestamps. For example: s3://BUCKET/ FOLDERNAME/2013-26-05-15-00-00-FILENAME.zip

Add a random prefix to the key names to improve S3 performance?

Tags:

Amazon S3

Related

Recent Posts