How do you track record relations in NoSQL?

All the answers for how to store many-to-many associations in the "NoSQL way" reduce to the same thing: storing data redundantly.

In NoSQL, you don't design your database based on the relationships between data entities. You design your database based on the queries you will run against it. Use the same criteria you would use to denormalize a relational database: if it's more important for data to have cohesion (think of values in a comma-separated list instead of a normalized table), then do it that way.

But this inevitably optimizes for one type of query (e.g. comments by any user for a given article) at the expense of other types of queries (comments for any article by a given user). If your application has the need for both types of queries to be equally optimized, you should not denormalize. And likewise, you should not use a NoSQL solution if you need to use the data in a relational way.

There is a risk with denormalization and redundancy that redundant sets of data will get out of sync with one another. This is called an anomaly. When you use a normalized relational database, the RDBMS can prevent anomalies. In a denormalized database or in NoSQL, it becomes your responsibility to write application code to prevent anomalies.

One might think that it'd be great for a NoSQL database to do the hard work of preventing anomalies for you. There is a paradigm that can do this -- the relational paradigm.


The couchDB approach suggest to emit proper classes of stuff in map phase and summarize it in reduce.. So you could map all comments and emit 1 for the given user and later print out only ones. It would require however lots of disk storage to build persistent views of all trackable data in couchDB. btw they have also this wiki page about relationships: http://wiki.apache.org/couchdb/EntityRelationship.

Riak on the other hand has tool to build relations. It is link. You can input address of a linked (here comment) document to the 'root' document (here user document). It has one trick. If it is distributed it may be modified at one time in many locations. It will cause conflicts and as a result huge vector clock tree :/ ..not so bad, not so good.

Riak has also yet another 'mechanism'. It has 2-layer key name space, so called bucket and key. So, for student example, If we have club A, B and C and student StudentX, StudentY you could maintain following convention:

{ Key = {ClubA, StudentX}, Value = true }, 
{ Key = {ClubB, StudentX}, Value = true }, 
{ Key = {ClubA, StudentY}, Value = true }

and to read relation just list keys in given buckets. Whats wrong with that? It is damn slow. Listing buckets was never priority for riak. It is getting better and better tho. btw. you do not waste memory because this example {true} can be linked to single full profile of StudentX or Y (here conflicts are not possible).

As you see it NoSQL != NoSQL. You need to look at specific implementation and test it for yourself.

Mentioned before Column stores look like good fit for relations.. but it all depends on your A and C and P needs;) If you do not need A and you have less than Peta bytes just leave it, go ahead with MySql or Postgres.

good luck