Why are relational databases having scalability issues?

Relational databases provide solid, mature services according to the ACID properties. We get transaction-handling, efficient logging to enable recovery etc. These are core services of the relational databases, and the ones that they are good at. They are hard to customize, and might be considered as a bottleneck, especially if you don't need them in a given application (eg. serving website content with low importance; in this case for example, the widely used MySQL does not provide transaction handling with the default storage engine, and therefore does not satisfy ACID). Lots of "big data" problems don't require these strict constrains, for example web analytics, web search or processing moving object trajectories, as they already include uncertainty by nature.

When reaching the limits of a given computer (memory, CPU, disk: the data is too big, or data processing is too complex and costly), distributing the service is a good idea. Lots of relational and NoSQL databases offer distributed storage. In this case however, ACID turns out to be difficult to satisfy: the CAP theorem states somewhat similar, that availability, consistency and partition tolerance can not be achieved at the same time. If we give up ACID (satisfying BASE for example), scalability might be increased. See this post eg. for categorization of storage methods according to CAP.

An other bottleneck might be the flexible and clever typed relational model itself with SQL operations: in lots of cases a simpler model with simpler operations would be sufficient and more efficient (like untyped key-value stores). The common row-wise physical storage model might also be limiting: for example it isn't optimal for data compression.

There are however fast and scalable ACID compliant relational databases, including new ones like VoltDB, as the technology of relational databases is mature, well-researched and widespread. We just have to select an appropriate solution for the given problem.


Imagine two different kinds of crossroads.

One has traffic lights or police officers regulating traffic, motion on the crossroad is at limited speed, and there's a watchdog registering precisely what car drove on the crossroad at what time precisely, and what direction it went.

The other has none of that and everyone who arrives at the crossroad at whatever speed he's driving, just dives in and wants to get through as quick as possible.

The former is any traditional database engine. The crossroad is the data itself. The cars are the transactions that want to access the data. The traffic lights or police officer is the DBMS. The watchdog keeps the logs and journals.

The latter is a NOACID type of engine.

Both have a saturation point, at which point arriving cars are forced to start queueing up at the entry points. Both have a maximal throughput. That threshold lies at a lower value for the former type of crossroad, and the reason should be obvious.

The advantage of the former type of crossroad should however also be obvious. Way less opportunity for accidents to happen. On the second type of crossroad, you can expect accidents not to happen only if traffic density is at a much much lower point than the theoretical maximal throughput of the crossroad. And in translation to data management engines, it translates to a guarantee of consistent and coherent results, which only the former type of crossroad (the classical database engine, whether relational or networked or hierarchical) can deliver.

The analogy can be stretched further. Imagine what happens if an accident DOES happen. On the second type of crossroad, the primary concern will probably be to clear the road as quick as possible, so traffic can resume, and when that is done, what info is still available to investigate who caused the accident and how ? Nothing at all. It won't be known. The crossroad is open just waiting for the next accident to happen. On the regulated crossroad, there's the police officer regulating the traffic who saw what happened and can testify. There's the logs saying which car entered at what time precisely, at which entry point precisely, at what speed precisely, a lot of material is available for inspection to determine the root cause of the accident. But of course none of that comes for free.

Colourful enough as an explanation ?


Take the simplest example: insert a row with generated ID. Since IDs must be unique within table, database must somehow lock some sort of persistent counter so that no other INSERT uses the same value. So you have two choices: either allow only one instance to write data or have distributed lock. Both solutions are a major bottle-beck - and is the simplest example!