amazon s3 partitioning of files best practices

Update 2018-07

It is no longer required to account for performance when devising a partitioning scheme for your use case, see my InfoQ summary Amazon S3 Increases Request Rate Performance and Drops Randomized Prefix Requirement for details:

Amazon Web Services (AWS) recently announced significantly increased S3 request rate performance and the ability to parallelize requests to scale to the desired throughput. Notably this performance increase also "removes any previous guidance to randomize object prefixes" and enables the use of "logical or sequential naming patterns in S3 object naming without any performance implications".

Update 2013-09

The information in the referenced link, while still largely accurate, has been supplanted by a newer document, S3 Request Rate and Performance Considerations.


Initial answer

This is a problem with Amazon S3 as well, albeit only for significant storage requirements, see Amazon S3 Performance Tips & Tricks for a detailed answer including strategies for partitioning your object space.


Previous answers are obsolete now https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/ "This S3 request rate performance increase removes any previous guidance to randomize object prefixes to achieve faster performance. That means you can now use logical or sequential naming patterns in S3 object naming without any performance implications. "


Its worth thinking of a scheme to chunk it up onto files... if for no other reason than just having a way to filter your files if you want to manually look around.

But dont spend too much time if you are certain as to all the ways you will need to commonly access your files... You can always migrate to a new scheme later.

YEARS LATER

I organize all buckets like this by default:

bucket:/type/YYYY/MM/DD/human_useful_filename_UNIQ_STUFF.ext

Where:

  • bucket = the bucket name
  • type = Type of artifact as defined by my app
  • YYYY/MM/DD - what you think
  • human_useful_filename_UNIQ_STUFF.ext - I put something at least slightly debuggable as the 1st part of the filename, and then something to ensure it's unique in the suffix, followed by the regular extension. That way, if you do find yourself lurking in S3's UI or console, you can at least try to ascertain what's going in (more useful in dev & test context, at least).

If you have lots of objects (on average > 1000 per day), then even splitting on HH is worth it too.