What's the point of using Amazon SimpleDB?

I use SDB on a couple of large-ish applications. The 10 GB limit per domain does worry me, but we are gambling on Amazon allowing this to be extended if we need it. They have a request form on their site if you want more space.

As far as cross domain joins, don't think of SDB as a traditional database. During the migration of my data to SDB, I had to denormalize some of it so I could manually do the cross domain joins.

The 1000 byte per attribute limitation was tough to work around also. One of the applications I have is a blog service which stores posts and comments in the database. While porting it over to SDB, I ran into this limitation. I ended up storing the posts and comments as files in S3, and read that in my code. Since this server is on EC2, the traffic to S3 isn't costing anything extra.

Perhaps one of the other problems to watch out for is the eventual consistency model on SDB. You can't write data and then read it back with any guarantee that the newly written data will be returned to you. Eventually the data will be updated.

All of this said, I still love SDB. I don't regret switching to it. I moved from a SQL 2005 server. I think I had a lot more control with SQL, but once giving up that control, I have more flexibility. Not needing to pre-define the schema is awesome. With a strong and robust caching layer in your code, it's easy to make SDB more flexible.


I have about 50GB in SimpleDB, sharded across 30 domains. I use this to allow multiple keys on objects stored in S3, and also to reduce my S3 costs. I haven't played with using SimpleDB for full-text search, but I would not attempt it.

SimpleDB works, it's easy, and so on, but it isn't the right set of features for every situation. In your case, if you need aggregation, SimpleDB is not the right solution. It is built around the school of thought that the DB is just a key value store, and aggregation should be handled by an aggregation process that writes the results back to the key value store. This is exactly what is needed for some applications.

Here is a description of how I pinch pennies using SimpleDB


It's worth adding that while having to write your own sharding logic across domains is not ideal, it is in terms of performance. If for example you need to search across 100gb of data, it's better to ask 20 machines holding 5gb each to perform the same search on the portion they're responsible for, rather than one machine having to perform the entire task. If your goal is to end up with a sorted list, you can take the best results returned from the 20 simultaneous queries and collate them on the machine initiating the request.

That said, I would rather like to see this abstracted from normal use and have something like "hints" in the API if you want to get lower-level. So if you happen to store 100gb of data, let Amazon decide if it's partitioned across 20 machines or 10 or 40, and distribute the work. For example, in Google's BigTable design, as a table grows it's continually partitioned into 400mb tablets. Asking for a row from a table is as simple as that, and BigTable does the job of figuring out where in the one tablet or millions of tablets it lives.

Then again, BigTable requires you to write MapReduce calls to perform a query, while SimpleDB indexes itself dynamically for you, so you win some, you lose some.