How to keep DB in sync when using microservices architecture?

each microservice need its own database

A separate DB per microservice is not a prerequisite (nor a requirement, really).

You can have as many microservices as you want working on top of the same database, but use different schemas for example.

The bounded context of a microservice should be the boundary.

Lets say we have very high load on this service, so we choose to scale out 20x.

Scaling to (X) instances of the same microservice does not mean necessarily having a separate database per each instance of that same service.

Most databases are designed with concurrent connections, users, transactions in mind. a single database instance (with some optimistic concurrency) can handle hundreds (if not thousands) of concurrent connections gracefully.

If you explicitly chose to have a separate DB per instance of the same service, then you will have to sync those databases up. and, most likely, data consistency will suffer for it.

Here are some suggestions:

use a single database per microservice (not per instance) no matter how many instances are using it. And only consider a DB per instance when you're sure a single DB cannot handle the load.
Use a shared cache layer on top of the DB (maybe redis cache)
Use a database cluster to deal with high load/availability of databases.

While using the same database for multiple services may be possible, it should be avoided as it'll create a higher coupling between services than is desirable. E.g. a database downtime will affect all services with sharing but only a single one if each service has their own one.

To avoid a "distributed monolith" of services that do synchronous calls to each other (e.g. using REST), you could work with a streaming based approach. Each service would publish a change event whenever its data changes, and other services can subscribe to these streams. So they can react to data changes relevant to them, e.g. by storing a local version of the data (in a representation suited to their needs, e.g. just columns they are interested int) in their own database. That way they can provide their functionality, also if other services aren't available for some time. Naturally, such architecture employs semantics of eventual consistency, but usually that's inevitable in distributed systems anyways.

One way to set up such data streams is change data capture CDC, which will trail the databases log files (e.g. the binlog in MySQL) and publish corresponding events for each INSERT, UPDATE and DELETE. One open source CDC tool is Debezium which comes with connectors for MySQL, Postgres, MongoDB as well as (work-in-progress as of now) Oracle and SQL Server. It can be used with Apache Kafka as the streaming backbone or as library within your Java applications, allowing you to stream data changes into other streaming layers such as Pulsar or Kinesis with just a bit of code. One nice advantage of using persistent topics for the change events, e.g. with Kafka, is that new services can come up and re-read the entire change stream (depending on the topic's retention policy) or just get the current state of each record to do an initial seed of their local database.

(Disclaimer: I'm the lead of Debezium)

Going to multiple databases only changes one problem of software architecture for one of distributed coordination, the latter which, imho, is a much more difficult problem.

People suggest using event systems, which means each single service now has to have its own little solution for distributed coordination of data, ACID goes out the window. Look at the database landscape and you'll see that this is not a easy, or completely solved issue. And then go to distributed coordinated transactions...

There are many times were you would prefer downtime to having N databases in completely unknown inconsistent states. Also the perception of up time is misleading, yes your services are up but if they have inconsistent views of the same data or missing data (missed events) are they really functioning? or will they produce inconsistent and errored results?

Either you have two services that completely does not rely on having the same data, or you need a shared consistent data layer. But copying between N dbs using event systems and hoping for the best, well, your choice.

The question of distribution, persistence, consistency and availability should be treated at the storage layer, and not adhoc by each service in the application layer. It takes care and specialized dedicated knowledge of many minds to make such a system and even then there are flavors and trade offs (CAP theorem).

Lastly: Most people look to microservices to be able to develop and evolve their applications quicker than via monoliths. Dealing with distributed coordination and consistency of storage in each microservice will do the contrary.

How to keep DB in sync when using microservices architecture?

Tags:

Architecture

Microservices

Related

Recent Posts