In-memory vs persistent state stores in Kafka Streams?

I don't see any real reason to swap current RocksDB store. In fact RocksDB one of the fastest k,v store: Percona benchmarks (based on RocksDB)

with in-memory ones - RocksDB already acts as in-memory with some LRU algorithms involved:

RocksDB architecture

The three basic constructs of RocksDB are memtable, sstfile and logfile. The memtable is an in-memory data structure - new writes are inserted into the memtable and are optionally written to the logfile.

But there is one more noticeable reason for choosing this implementation:

RocksDB source code

If you will look at source code ratio - there are a lot of Java api exposed from C++ code. So, it's much simpler to integrate this product in existing Java - based Kafka ecosystem with comprehensive control over store, using exposed api.


I've got a very limited understanding of the internals of Kafka Streams and the different use cases of state stores, esp. in-memory vs persistent, but what I managed to learn so far is that a persistent state store is one that is stored on disk (and hence the name persistent) for a StreamTask.

That does not give much as the names themselves in-memory vs persistent may have given the same understanding, but something that I found quite refreshing was when I learnt that Kafka Streams tries to assign partitions to the same Kafka Streams instances that had the partitions assigned before (a restart or a crash).

That said, an in-memory state store is simply recreated (replayed) every restart which takes time before a Kafka Streams application is up and running while a persistent state store is something already materialized on a disk and the only time the Kafka Streams instance has to do to re-create the state store is to load the files from disk (not from the changelog topic that takes longer).

I hope that helps and I'd be very glad to be corrected if I'm wrong (or partially correct).