Cassandra denormalization datamodel

"Yes" for the most part, taking an approach of query-based data modeling really is the best way to do it.

  1. That is still a good idea to do, because the speed of your query times make it worth it. Yes, there's a little more housecleaning to do. I haven't had to execute 100s of deletes from other column families, but occasionally there is some complicated clean-up to do. But, you shouldn't be doing a whole lot of deleting in Cassandra anyway (anti-pattern).

  2. No. Client-side JOINs are just as bad as distributed JOINs. The whole idea is to create a table to return data for each specific query...denormalized and/or replicated...and thus negating the need to do a JOIN at all. The exception to this, is if you are running OLAP queries for analysis, you can use a tool like Apache Spark to execute an ad-hoc, distributed JOIN. But it's definitely not something you'd want to do on a production system.

  3. A few articles I can recommend:

    • Getting Started with Cassandra Time Series Data Modeling - Written by DataStax's Chief Evangelist Patrick McFadin, it covers one of the more common Cassandra use cases in a few different ways.
    • Escaping From Disco-Era Data Modeling - This one talks about some of the obstacles that beginners with Cassandra can face, as well as the general approach to take in overcoming them. Disclaimer: I am the author.
    • Cassandra Data Modeling Best Practices, Part 1 - You can't go wrong with Jay Patel's (eBay) classic article on Cassandra modeling practices. It's a little dated in that the examples are grounded in the pre-CQL world, but the techniques still resonate.