Partition Kinesis firehose S3 records by event time

Kinesis Firehose doesn't (yet) allow clients to control how the date suffix of the final S3 objects is generated.

The only option with you is to add a post-processing layer after Kinesis Firehose. For e.g., you could schedule an hourly EMR job, using Data Pipeline, that reads all files written in last hour and publishes them to correct S3 destinations.


It's not an answer for the question, however I would like to explain a little bit the idea behind storing records in accordance with event arrival time.

First a few words about streams. Kinesis is just a stream of data. And it has a concept of consuming. One can reliable consume a stream only by reading it sequentially. And there is also an idea of checkpoints as a mechanism for pausing and resuming the consuming process. A checkpoint is just a sequence number which identifies a position in the stream. Via specifying this number, one can start reading the stream from the certain event.

And now go back to default s3 firehose setup... Since the capacity of kinesis stream is quite limited, most probably one needs to store somewhere the data from kinesis to analyze it later. And the firehose to s3 setup does this right out of the box. It just stores raw data from the stream to s3 buckets. But logically this data is the still the same stream of records. And to be able to reliable consume (read) this stream one needs these sequential numbers for checkpoints. And these numbers are records arrival times.

What if I want to read records by creation time? Looks like the proper way to accomplish this task is to read the s3 stream sequentially, dump it to some [time series] database or data warehouse and do creation-time-based readings against this storage. Otherwise there will be always a non-zero chance to miss some bunches of events while reading the s3 (stream). So I would not suggest the reordering of s3 buckets at all.