Apache Kafka order windowed messages based on their value

Here's an outline:

Create a Processor implementation that:

  • in process() method, for each message:

    • reads the timestamp from the message value
    • inserts into a KeyValueStore using (timestamp, message-key) pair as the key and the message-value as the value. NB this also provides de-duplication. You'll need to provide a custom Serde to serialize the key so that the timestamp comes first, byte-wise, so that ranged queries are ordered by timestamp first.
  • in the punctuate() method:

    • reads the store using a ranged fetch from 0 to timestamp - 60'000 (=1 minute)
    • sends the fetched messages in order using context.forward() and deletes them from the store

The problem with this approach is that punctuate() is not triggered if no new msgs arrive to advance the "stream time". If this is a risk in your case, you can create an external scheduler that sends periodic "tick" messages to each(!) partition of your topic, that your processor should just ignore, but they'll cause punctuate to trigger in the absence of "real" msgs. KIP-138 will address this limitation by adding explicit support for system-time punctuation: https://cwiki.apache.org/confluence/display/KAFKA/KIP-138%3A+Change+punctuate+semantics