Fault tolerance in mysql servers

  • What is likely to break?
  • How risk-adverse are you?
  • Do you need to repair the failure, or just discover it?

A "parity" check can only discover failure, hence it is not fault-tolerant.

Many people decide that the main failure point is the disk subsystem, so they use RAID. In that context, you can assume that parity is can be used to repair -- but only because there is some other mechanism to say "this drive failed".

But what if the motherboard dies?

So you use Replication and Master-Slave with the two servers sitting next to each other.

But what if the power to the building goes out? Or a tornado hits the two servers? Or ...

So you put the Slave in another data center.

(I could continue this silly story, but I won't.)

You can put two "instances" of MySQL on the same server, then have one replicate to the other. Then put a lot of spin on how great your solution is. (Never mind that practically any failure will take out both copies.)

Or you could spend a few bucks and rent Amazon space for the other server. Then you can honestly brag about the "fault-tolerance".

After Bounty Edit

Use Engine=InnoDB; this gives you simpler recover from server crash.

After loading the data, take a dump (mysqldump or other) of the static table, and store it elsewhere. This is for "disaster recovery" from floods, meteorites, software glitches, disk failure, etc. Reloading would be manual and take some time (but you have not put limits on that).

Those are simple measures, and effectively cover virtually all disasters. If I were to list other things that can go wrong with a mysql setup, "parity" does not show up as a part of any solution.

To finish the task you have, set up your disk with RAID-5. Three drives is the minimum. You could probably fake it with software raid and partitions of a single drive. However, this would make useless for recovering from any kind of failure; instead it would show the use of "parity".

"Checksums" are more often used for catching (but not correcting) errors. This is typically 4-8 bytes overhead for 512-16KB bytes of data. That is not, technically, "parity", but it is more efficient.

One parity bit per byte gives you error detection, but not error correction. See SECDED for correction. That needs, for example, 8 bits on a 64-bit 'word'. Seymour Cray said "parity is for farmers", but eventually he implemented SECDED in 'core' memories. (This was in the '70s. Does your prof date back that far?)


The DBMS can be left to deal with fault detection and recovery (when configured correctly). It would be most unusual to manually implement such behaviour in the application. Indeed, that is what a software stack is for - to remove from applications those features which are common and frequently required, much as an OS looks after memory management and thread scheduling, say.

That said, you could add another column to each table containing the character strings. It will hold a hash of the string. Retrieve the hash along with the string, recalculate and throw an error if the two hashes differ.

Given this is a university thesis the professor may be trying to convey a learning point, aside from any practicality about implementation. Your long-term benefit may be to investigate possible implementations of his suggestion, rather than listing reasons that he's an idiot. Just sayin'.


There are several things here. If you want to assure data integrity, you can use an engine with such a thing built-in. MySQL's defaul engine, InnoDB, calculates a checksum of every page (usually, 16K of data) and will automatically shutdown itself if the checksum fails (in most cases, due to a hardware issue; but it could potentially be a bug itself, or someone manually tampering with the files).

Also, in the event of a crash, InnoDB uses a transaction log to recover lost transactions that may not be fully synced to disk due to memory buffering.

The checksums guarantee the physical consistency of the data, but they do not allow to recover from them; and they do not prevent other issues like logical incorrectness or accidental deletion.

A replication setup allows for service redundancy (and even geographically redundant), and certain data redundancy against data loss due to hardware. However, as replication is in almost-real time, it does not protect against things like accidental user deletion. Also, while replication has, in the latest versions, checksumming of the data on the wire, that is very rarely an issue. However, there is a relatively common issue with a mysql slave- replication errors can arise due to unsafe/non-deterministic queries (those that could return different results on two different servers), different configuration or potential risky operations such as improper filtering. There are third party tools like pt-table-checksum that allows to compare two replicas while data is being written (and also provide user-level checksums).

While something like a delayed slave may be useful to avoid those issues, the most common way to assure data survivability is performing regular backups. In particular, full backups plus binary logs allow point in time recovery, and you can add checksums to verify data is not corrupted while stored. Logical backups are slower than row backups for large datasets, but safer if physical integrity is something one wants to avoid.

Answering directly your questions:

  • With InnnoDB checksums and its transaction log, it is very difficult to have data corruption. Data redundancy can also be achieved at filesystem level by using several disks in a RAID configuration (allowing for extra performance and/or data safety by mirroring or maintaining data parities)
  • Use backups to provide data redundancy, replication to provide service redundancy
  • No difference at all in replication features between community and enterprise, except the paid support and some utilities that may be helpful