How to join tables in AWS DynamoDB?

Update: This answer is well within the defined community guidelines and not a non-answer speaking only about a commercial solution.


One solution I have seen come up multiple times in this space is to sync from DynamoDB into a separate database that is more well suited for the types of operations you're looking for.

I wrote a blog about this topic comparing various approaches I've seen people take to this very problem, but I'll summarize some of the key takeaways here so you don't have to read all of it.

DynamoDB secondary indexes

What's good?

  1. Fast and no other systems needed!
  2. Good for a very specific analytic feature you're building (like a leaderboard)

Considerations

  1. Limited # of secondary indexes, limited fidelity of queries
  2. Expensive if you're depending on scans
  3. Security and performance concerns using production database directly for analytics

DynamoDB + Glue + S3 + Athena

Architecture

What's good?

  1. All components are “serverless” and require no provisioning of infrastructure
  2. Easy to automate ETL pipeline

Considerations

  1. High end-to-end data latency of several hours, which means stale data
  2. Query latency varies between tens of seconds to minutes
  3. Schema enforcement can lose information with mixed types
  4. ETL process can require maintenance from time to time if structure of data in source changes

DynamoDB + Hive/Spark

Architecture

What's good?

  1. Queries over latest data in DynamoDB
  2. Requires no ETL/pre-processing other than specifying a schema

Considerations

  1. Schema enforcement can lose information when fields have mixed types
  2. EMR cluster requires some administration and infrastructure management
  3. Queries over the latest data involves scans and are expensive
  4. Query latency varies between tens of seconds to minutes directly on Hive/Spark
  5. Security and performance implications of running analytical queries on an operational database

DynamoDB + AWS Lambda + Elasticsearch

What's good?

  1. Full-text search support
  2. Support for several types of analytical queries
  3. Can work over the latest data in DynamoDB

Considerations

  1. Requires management and monitoring of infrastructure for ingesting, indexing, replication, and sharding
  2. Requires separate system to ensure data integrity and consistency between DynamoDB and Elasticsearch
  3. Scaling is manual and requires provisioning additional infrastructure and operations
  4. No support for joins between different indexes

DynamoDB + Rockset

Architecture

What's good?

  1. Completely serverless. No operations or provisioning of infrastructure or database required
  2. Live sync between DynamoDB and the Rockset collection, so that they are never more than a few seconds apart
  3. Monitoring to ensure consistency between DynamoDB and Rockset
  4. Automatic indexes built over the data enabling low-latency queries
  5. SQL query serving that can scale to high QPS
  6. Joins with data from other sources such as Amazon Kinesis, Apache Kafka, Amazon S3, etc.
  7. Integrations with tools like Tableau, Redash, Superset, and SQL API over REST and using client libraries.
  8. Features including full-text search, ingest transformations, retention, encryption, and fine-grained access control

Considerations

  1. Not a great fit for storing rarely queried data (like machine logs)
  2. Not a transactional datastore

(Full Disclosure: I work on the product team @ Rockset) Check out the blog for more details on the individual approaches.


You are correct, DynamoDB is not designed as a relational database and does not support join operations. You can think about DynamoDB as just being a set of key-value pairs.

You can have the same keys across multiple tables (e.g. document_IDs), but DynamoDB doesn't automatically sync them or have any foreign-key features. The document_IDs in one table, while named the same, are technically a different set than the ones in a different table. It's up to your application software to make sure that those keys are synced.

DynamoDB is a different way of thinking about databases and you might want to consider using a managed relational database such as Amazon Aurora: https://aws.amazon.com/rds/aurora/

One thing to note, Amazon EMR does allow DynamoDB tables to be joined, but I'm not sure that's what you're looking for: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMRforDynamoDB.html


With DynamoDB, rather than join I think the best solution is to store the data in the shape you later intend to read it.

If you find yourself requiring complex read queries you might have fallen into the trap of expecting DynamoDB to behave like an RDBMS, which it is not. Transform and shape the data you write, keep the read simple.

Disk is far cheaper than compute these days - don't be afraid to denormalise.