Is it safe to disable sync_binlogs, innodb_flush_log_at_trx_commit and innodb_doublewrite on a site-redundant galera cluster?

I've done some research on this now, so I'll answer on my own question.

Summary first

If the performance problems are observed only on one node, it's perfectly safe to turn off innodb_doublewrite and innodb_flush_log_at_trx_commit on the affected node - just remember to use one of the other nodes for bootstrapping should the whole cluster fail.

# echo "[server]" >> /etc/mysql/conf.d/temp_perfomance_hax.cnf
# echo "innodb_doublewrite = 0" >> /etc/mysql/conf.d/temp_performance_hax.cnf
# echo "innodb_flush_log_at_trx_commit = 2" >> /etc/mysql/conf.d/temp_performance_hax.cnf
# systemctl restart mysql # innodb_doublewrite cannot be set run-time

Also, in a true site-redundant galera cluster, it should be reasonably safe to put innodb_flush_log_at_trx_commit to 2 and sync_binlog to 0 on any number of nodes. If one is experiencing performance problems, if uptime and performance is important, and if one lost transaction probably isn't a matter of life and death - then no hesitations needed, just go ahead and do it.

set global sync_binlog = '0';
set global innodb_flush_log_at_trx_commit = 2;

and one can probably go ahead and turn off double write as well if performance is more important than ACID-compliance. The probability of disasters is pretty low.

I can see two reasons for sticking to the "safe" settings:

  • Preparedness for the worst-case-scenario: what if the whole cluster melts down simultaneously?
  • More robustness on the nodes

Node robustness: keep innodb_doublewrite turned on

innodb_doublewrite can be safely turned off at one node without losing ACID-complience, but it may still be better to have it turned on. According to When is it safe to disable InnoDB doublewrite buffering? turning off innodb-doublewrite may cause corrupted data, in that case it may be needed to wipe the data partition on the downed node and start it up again. This comes with three costs:

  • manual sysadmin intervention
  • elevated levels of IO on the cluster while the corrupted node is performing a SST
  • the cluster will have to run with reduced redundancy until the node is fully operative again.

(As far as I can understand, a missing flush on log files shouldn't cause problems starting a db node - i.e. for innodb_flush_log_at_trx_commit the manual explicitly states that "InnoDB's crash recovery works regardless of the value")

Reasons why the whole cluster may burn down at once

Hard explosions

Consider innodb_flush_log_at_trx_commit to be set to 2, binlog_sync to be set to 0 and innodb-doublewrite-buffering set to 1, this may give performance gains, and as far as I can understand the only failure mode it won't cover is if a quorum of nodes are suffering a power-loss or kernel panic. For a true multi-site-setup, those seems rare enough that one may want to neglect them.

  • If the nodes are located nearby, a local problem, i.e. power outage, flooding, fire, etc can knock all of the nodes out in the same time. I've multiple times witnessed a whole server site going down for the weirdest reasons even though everything was supposed to be redundant.

  • Could solar storms affect the whole cluster simultaneously? Especially if the cluster is located in the same city and in polar regions? I've seen two laptops crashing simultaneously, being in the polar region during a solar storm, and with no other plausible reasons (different OS'es, it wasn't under a leap second, etc).

  • If all nodes runs the same OS, some OS kernel problem could potentially cause all nodes to crash hard with a kernel panic in exactly the same time.

Software crashes

I believe that if innodb_flush_log_at_trx_commit is set to 0 or innodb_flush_log_at_trx_commit is set to 0, then a mysql crash affecting all nodes simultaneously could cause data loss.

Has anyone ever observed galera crashing in such a way?

A system admin or buggy script doing "kill -9" simultaneously on mysql on all the nodes too ... but if the system admin wants to cause data loss, he will probably find a way regardless of those settings.

Cluster going down for other reasons

In those cases, those settings shouldn't matter:

  • Sudden and total network isolation between all three nodes, or leaving the cluster fragmented so no quorum can be found. One would need to bootstrap the cluster after such an incident, and if one really want to be 100% sure that no transactions are lost, one needs to manually investigate binlogs (see more on that below) - but since neither mysql nor the OS has crashed, all state will eventually be synced, no matter the sync settings.

  • Bad timing - i.e. one node going down and while it's down another node could go down or there could be some network isolation. Afterwards, the cluster needs to be bootstrapped. However, everything that the nodes have received will be written to disk, so the above mentioned settings will not affect anything. (We've had this coming over us once).

Manual intervention is needed on cluster failures

No matter the settings, to be 100% sure that no transaction is lost after a catastrophic cluster failure, it's actually needed to manually inspect/merge binlogs. We need to bootstrap the cluster using only one node, but we don't know if this node got all the changesets, this info is only available by investigating the binlogs from all nodes (with log_slave_updates turned on, one can probably do with binlogs from a quorum of the nodes).

This means that you need to make a decision in advance - what should be the priority if we encounter a catastrophic cluster meltdown? Is the priority to get the cluster up and running again as fast as possible, or is it to make 100% sure no transactions have been lost before bootstrapping the cluster?

Your mileage may vary

If some downtime is acceptable under extreme circumstances (a full catastrophic db cluster meltdown is extreme) and if it's paramount to ensure all transactions are in place before returning to production mode, then one needs to make 100% sure that the binlogs are synced and kept, and that the backup regime is good. Syncing innodb logs and double write is then non-critical and can be turned off if that gives performance benefits.

If "getting the cluster up and running again as fast as possible" is important, then it's important to set innodb_flush_log_at_trx_commit = 1 and set innodb_doublewrite = 1 at a quorum of the nodes; if the cluster should go down then it can be quickly bootstrapped from any of the servers with those settings turned on. (Some consideration may be needed. On our previous meltdown node C fell down and got up again. Network glitch between node A and node B before node C got properly joined into the cluster, resulting in a shutdown. Obviously, we can bootstrap from A or B without any data loss, but not from C).