bit rot detection and correction with mdadm

Frankly, I find it rather surprising that you'd reject RAIDZ2 ZFS. It seems to suit your needs almost perfectly, except for the fact that it isn't Linux MD. I'm not on a crusade to bring ZFS to the masses, but the simple fact is that yours is one of the kinds of problems that ZFS was designed from the ground up to solve. Relying on RAID (any "regular" RAID) to provide error detection and correction possibly in a reduced- or no-redundancy situation seems risky. Even in situations where ZFS cannot correct a data error properly, it can at least detect the error and let you know that there is a problem, allowing you to take corrective action.

You don't have to do regular full scrubs with ZFS, although it is recommended practice. ZFS will verify that the data read from disk matches what was written as the data is being read, and in the case of a mismatch either (a) use redundancy to reconstruct the original data, or (b) report an I/O error to the application. Also, scrubbing is a low-priority, online operation, quite different from a file system check in most file systems which can be both high-priority and offline. If you're running a scrub and something other than the scrub wants to do I/O, the scrub will take the back seat for the duration. A ZFS scrub takes the place of both a RAID scrub and a file system metadata and data integrity check, so is a lot more thorough than just scrubbing the RAID array to detect any bit rot (which doesn't tell you if the data makes any sense whatsoever, only that it's been written correctly by the RAID controller).

ZFS redundancy (RAIDZ, mirroring, ...) has the advantage that unused disk locations don't need to be checked for consistency during scrubs; only actual data is checked during scrubs, as the tools walk the allocation block chain. This is the same as with a non-redundant pool. For "regular" RAID, all data (including any unused locations on disk) must be checked because the RAID controller (whether hardware or software) has no idea what data is actually relevant.

By using RAIDZ2 vdevs, any two constituent drives can fail before you are at risk of actual data loss from another drive failure, as you have two drives' worth of redundancy. This is essentially the same as RAID6.

In ZFS all data, both user data and metadata, is checksummed (except if you choose not to, but that is recommended against), and these checksums are used to confirm that the data hasn't changed for any reason. Again, if a checksum does not match the expected value, the data will either be transparently reconstructed or an I/O error will be reported. If an I/O error is reported, or a scrub identifies a file with corruption, you will know for a fact that the data in that file is potentially corrupted and can restore that specific file from backup; no need for a full array restore.

Plain, even double-parity, RAID doesn't protect you against situations like for example when one drive fails and one more reads the data incorrectly off the disk. Suppose one drive has failed and there's a single bit flip anywhere from any one of the other drives: suddenly, you've got undetected corruption, and unless you're happy with that you'll need a way to at least detect it. The way to mitigate that risk is to checksum each block on disk and make sure the checksum cannot be corrupted along with the data (protecting against errors like high-fly writes, orphan writes, writes to incorrect locations on disk, etc.), which is exactly what ZFS does as long as checksumming is enabled.

The only real downside is that you cannot easily grow a RAIDZ vdev by adding devices to it. There are workarounds for that, usually involving things like sparse files as devices in a vdev, and very often termed "I wouldn't do this if it was my data". Hence, if you go a RAIDZ route (regardless of whether you go with RAIDZ, RAIDZ2 or RAIDZ3), you need to decide up front how many drives you want in each vdev. Although the number of drives in a vdev is fixed, you can grow a vdev by gradually (making sure to stay within the redundancy threshold of the vdev) replacing the drives with larger-capacity ones and allowing a complete resilver.


This answer is the product of reasoning based on the various bits of evidence I've found. I don't know how the kernel Linux implementation works, as I am not a kernel dev and there seems to be a fair amount of nonsensical misinformation out there. I presume that the kernel Linux makes sane choices. My answer should apply unless I am mistaken.

Many drives use ECCs (error-correcting codes) to detect read errors. If data is corrupt, the kernel should receive a URE (unrecoverable read error) for that block from an ECC supporting drive. Under these circumstances (and there is an exception below), copying corrupt, or empty, data over good data would amount to insanity. In this situation the kernel should know which is good data and which is bad data. According to the It is 2010 and RAID5 still works … article:

Consider this alternative, that I know to be used by at least a couple of array vendors. When a drive in a RAID volume reports a URE, the array controller increments a count and satisfies the I/O by rebuilding the block from parity. It then performs a rewrite on the disk that reported the URE (potentially with verify) and if the sector is bad, the microcode will remap and all will be well.

However, now for the exception: if a drive does not support ECC, a drive lies about data corruption, or the firmware is particularly disfunctional, then a URE may not be reported, and corrupted data would be given to the kernel. In the case of mismatching data: it seems that if you are using a 2 disk RAID1, or a RAID5, then the kernel can't know which data is correct, even when in a non-degraded state, because there is only one parity block and there was no reported URE. In a 3 disk RAID1 or a RAID6, a single corrupted non-URE-flagged block would not match the redundant parity (in combination with the other associated blocks), so proper automatic recovery should be possible.

The moral of the story is: use drives with ECC. Unfortunately not all drives that support ECC advertise this feature. On the other hand, be careful: I know someone who used cheap SSDs in a 2 disk RAID1 (or a 2 copy RAID10). One of the drives returned random corrupted data on each read of a particular sector. The corrupted data was automatically copied over the correct data. If the SSD used ECCs, and was properly functioning, then the kernel should have taken proper corrective action.


For the protection you want, I'd go with RAID6 + the normal offsite backup in 2 locations.

I personally scrub once a week anyway, and backup nightly, weekly and monthly depending on the data importance and change speed.

Tags:

Raid

Mdadm