Using Cassandra as an event store

Your design seem to be well modeled in "cassandra terms". The queries you need are indeed supported in "composite key" tables, you would have something like:

  • query 1: select * from events where id = 'id_event';
  • query 2: select * from events where id = 'id_event' and seq_num > NUMBER;

I do not think the second query is going to be inefficient, however it may return a lot of elements... if that is the case you could set a "limit" of events to be returned. If that is possible you can use the limit keyword.

Using composite keys seems like a good match for your specific requirements. Using "secondary indexes" do not seem to bring much to the table... unless I miss something in your design/requirements.

HTH.


Your partition key is too granular, you should create a composite partition key or change it to get better performance for time series modelling. For instance

CREATE TABLE events (
    event_date int,
    id timeuuid,
    seq_num int,
    data text,
    PRIMARY KEY  (event_date, id) );

This way your id will become a clustering column just to guarantee event unicqueness and your partition key (ie. 20160922) can group all events per day. You could change it to month as well. Avoid using uuid use timeuuid instead, it already store timestamp information.


What you've got is good, except in case of many events for a particular aggregate. One thing you could do is create a static column to hold "next" and "max_sequence". The idea being that the static columns would hold the current max sequence for this partition, and the "artificial id" for the next partition. You could then, say, store 100 or 1000 events per partition. What you've essentially done then is bucketed the events for an aggregate into multiple partitions. This would mean additional overhead for querying and storing, but at the same time protect against unbounded growth. You might even create a lookup for partitions for an aggregate. Really depends on your use case and how "clever" you want it to be.


I've been using Cassandra for a very similar scenario (with 100k+ columns per row) and ended with a model close to yours. I also agree with emgsilva that a secondary index probably won't bring much.

There are three things that turned out to be significant for good performance for our event store: Using composite columns, making sure that the columns are in a nicely sortable order (Cassandra sorts data in rows by columns), and using compact storage if possible.

Note that compact storage means you can only have one value column. Hence, you need to make all other columns part of the key.

For you, the schema would be:

CREATE TABLE events (
    id uuid,
    seq_num int,
    timestamp timestamp,
    data text,
    PRIMARY KEY  (id, seq_num, timestamp))
    WITH COMPACT STORAGE;

Tags:

Cassandra