kafka streams - how to set a new key for KTable

I don't think the way @Matthias described it is accurate/detailed enough. It is correct, but the root cause of such limitation(exists for ksqlDB CREATE TABLE syntax as well) is beyond just sheer fact that the keys must be unique for KTable.

The uniqueness in itself doesn't limit KTables. After all, any underlying topic can, and often does, contain messages with the same keys. KTable has no problem with that. It will just enforce the latest state for each key. There are multiple consequences of this, including the fact that KTable built from aggregated function can produce several messages into its output topic based on a single input message...But let's get back to your question.

So, the KTable needs to know which message for a specific key is the last message, meaning it's the latest state for the key.

What ordering guarantees does Kafka have? Correct, on per partition basis.

What happens when messages are re-keyed? Correct, they will be spread across partitions very different from the input message.

So, the initial messages with the same key were correctly stored by the broker itself into the same partition(if you didn't do anything fancy/stupid with your custom Partitioner) That way KTable can always infer the latest state.

But what happens if the messages are re-keyed inside Kafka Streams application in-flight?

They will spread across partitions again, but with a different key now, and if your application is scaled out and you have several tasks working in parallel you simply can't guarantee that the last message by a new key is actually the last message as it was stored in the original topic. Separate tasks don't have any coordination like that. And they can't. It won't be efficient otherwise.

As a result, KTable will lose its main semantic if such re-keying were allowed.

If you want to set a new key, you need to re-group the KTable:

KTable newTable = table.groupBy(/*put select key function here*/)
                       .aggregate(...);

Because a key must be unique for a KTable (in contrast to a KStream) it's required to specify an aggregation function that aggregates all records with same (new) key into a single value.

Since Kafka 2.5, Kafka Streams also support KStream#toTable() operator. Thus, it is also possible to do table.toStream().selectKey(...).toTable(). There are advantages and disadvantages for both approaches.

The main disadvantage of using toTable() is that it will repartition the input data based on the new key, which leads to interleaves writes into the repartition topic and thus to out-of-order data. While the first approach via groupBy() uses the same implementation, using the aggregation function helps you to resolve "conflicts" expliclity. If you use the toTable() operator, an "blind" upsert based on offset order of the repartition topic is done (this is actually similar to the code example in the other answers).

Example:

Key | Value
 A  | (a,1)
 B  | (a,2)

If you re-key on a your output table would be either once of both (but it's not defined with one):

Key | Value          Key | Value
 a  | 1               a  |  2

The operation to "rekey" a table is semantically always ill-defined.

@Matthias's answer led me down the right path, but I thought having a sample piece of code might help out here

final KTable<String, User> usersKeyedByApplicationIDKTable = usersKTable.groupBy(
        // First, going to set the new key to the user's application id
        (userId, user) -> KeyValue.pair(user.getApplicationID().toString(), user)
).aggregate(
        // Initiate the aggregate value
        () -> null,
        // adder (doing nothing, just passing the user through as the value)
        (applicationId, user, aggValue) -> user,
        // subtractor (doing nothing, just passing the user through as the value)
        (applicationId, user, aggValue) -> user
);

KGroupedTable aggregate() documentation: https://kafka.apache.org/20/javadoc/org/apache/kafka/streams/kstream/KGroupedTable.html#aggregate-org.apache.kafka.streams.kstream.Initializer-org.apache.kafka.streams.kstream.Aggregator-org.apache.kafka.streams.kstream.Aggregator-org.apache.kafka.streams.kstream.Materialized-

kafka streams - how to set a new key for KTable

Tags:

Java

Apache Kafka

Apache Kafka Streams

Related

Recent Posts