Disabling ext4 write barriers when using an external journal

I'm writing a new answer because after further analysis, I don't think the previous answer is correct.

If we look at the write_dirty_buffers function, it issues a write request with the REQ_SYNC flag, but it doesn't cause a cache flush, or barrier, to be issued. That is accomplished by the blkdev_issue_flush call, which is appropriately gated by a verification of the JDB2_BARRIER flag, which itself is only present when the filesystem is mounted with barriers enabled.

So if we look back at checkpoint.c, barriers only matter when a transaction is dropped from the journal. The comments in the code are informative here, telling us that this write barrier is unlikely to be necessary, but is there anyway as a safeguard. I think the assumption here is that by the time a transaction is dropped from the journal, the data itself is unlikely to be still lingering in the drive's cache, and not yet committed to permanent storage. But since it's only an assumption, the write barrier is issued anyway.

So why aren't barriers used when writing data to the main filesystem? I think the key here is that as long as the journal is coherent, metadata that's missing from the filesystem (eg. because it was lost in a power-loss event) is normally recovered during the journal replay, thus avoiding filesystem corruption. Furthermore, the use of data=journal should also guarantee consistency of actual filesystem data because, as I understand it, the recovery process will also write out data blocks that were committed to the journal as part of its replay mechanism.

So while ext4 does not actually flush disk caches at the end of a checkpoint, some steps should be taken to maximize recoverability in case of a power-loss:

  1. The filesystem should be mounted with data=journal, and not data=writeback (data=ordered is unavailable when using an external journal). This one should be obvious: we want a copy of all incoming data blocks inside the journal since those are the ones likely to be lost in a power-loss event. This isn't expensive performance-wise, since NVMe devices are very fast.

  2. The maximum journal size of 102400 blocks (400MB when using 4K filesystem blocks) should be used, so as to maximize the amount of data that's recoverable in a journal replay. This shouldn't be an issue since all NVMe devices are always at least several gigabytes in size.

  3. Problems may still arise in case an unexpected shutdown happens during a write-intensive operation. If transactions get dropped from the journal device faster than the data drives are able to flush their caches on their own, unrecoverable data loss or filesystem corruption could occur.

So the bottom line is, in my view, is that it's not 100% safe to disable write barriers, although some precautions can be implemented (#1 and #2) to make this setup a little safer.


Another way to put your question is this: when doing a checkpoint, i.e. when writing the data in the journal to the actual filesystem, does ext4 flush out the cache (of the rotating disks, in your case) before marking the transaction as completed and updating the journal accordingly?

If we look at the source code of jbd2 (which is responsible to handle the journalling) in checkpoint.c we see that jbd2_log_do_checkpoint() calls at the end:

__flush_batch(journal, &batch_count);

which calls:

write_dirty_buffer(journal->j_chkpt_bhs[i], REQ_SYNC);

So it seems like it should be safe.

Related: in the past a patch to use WRITE_SYNC in journal checkpoint was also proposed: The reason was that writing the buffers had too low priority and caused the journal to fill up while waiting for the write to complete