How beneficial are self-healing filesystems for general usage?

Solution 1:

Yes, a functional checksummed filesystem is a very good thing. However, the real motivation is not to be found into the mythical "bitrot" which, while does happen, is very rare. Rather, the main advantage is that such a filesystem provide and end-to-end data checksum, actively protecting you by erroneous disk behavior as misdirected writes and data corruption related to the disk's own private DRAM cache failing and/or misbehaving due to power supply problem.

I experienced that issue first hand, when a Linux RAID 1 array went bad due to a power supply issue. The cache of one disk started corrupting data and the ECC embedded in the disk sectors themselves did not catch anythig, simply because the written data were already corrupted and the ECC was calculated on the corrupted data themselves.

Thanks to its checksummed journal, which detected something strange and suspended the filesystem, XFS limited the damage; however, some files/directories were irremediably corrupted. As this was a backup machine facing no immediate downtime pressure, I rebuilt it with ZFS. When the problem re-occured, during the first scrub ZFS corrected the affected block by reading the good copies from the other disks. Result: no data loss and no downtime. These are two very good reasons to use a checksumming filesystem.

It's worth note that data checksum is so valuable that a device mapper target to provide it (by emulating the T-10 DIF/DIX specs), called dm-integrity, was developed precisely to extend this protection to classical block devices (especially redundant ones as RAID1/5/6). By the virtue of the Stratis project, it is going to be integrated into a comprehensive management CLI/API.

However, you have a point that any potential advantage brought by such filesystem should be compared to the disvantage they inherit. ZFS main problem is that it is not mainlined into the standard kernel, but otherwise is it very fast and stable. On the other hand BTRFS, while mainlined, has many important issues and performance problem (the common suggestion for databases or VMs is to disable CoW which, in turn, disabled checksumming - which is, frankly, not an acceptable answer). Rather then using BTRFS, I would use XFS and hope for the best, or using dm-integrity protected devices.

Solution 2:

  1. I had a Seagate HDD that started failing checksums each time I was running zfs scrub. It failed after a few weeks. ZFS and Btrfs have checksums for data and metadata. ext4 has only metadata chcksums.

  2. Only CRC errors and metadata checksum errors. Data corruption can happen.

  3. If it has bad sectors it is not a problem. The entire disk will be "failed", but you have the other disk that is "fine". The problem is when the data has correct CRC, but the data is corrupted. This can happen randomly because of large disks.


Solution 3:

I have been using ZFS in production, for both servers and a home office NAS, under both Linux and FreeBSD, for over 6 years. I have found it to be stable, fast, reliable, and I have personally seen it detect and (when able to) correct errors which a simple md device or ext4 filesystem would not have been able to.

However, I think I need to take a step back and try to understand if this benefit outweighs their disadvantages (Btrfs bugs and unresolved issues & ZFS availability and performance impact)

Regarding licensing, ZFS is open source it's just released under the CDDL license which is not legally compatible with the GPLv2 license that the linux kernel is released under. Details here. This does not mean it's in a state of "lincensing-limbo for a while" nor does it mean there's any technical incompatibility. It simply means the mainline linux kernel source doesn't have the modules and they have to be retrieved from somewhere like https://zfsonlinux.org . Note that some distros, like debian, include ZFS in their distribution Installing ZFS on Debian / Ubuntu can normally be done with a single apt command.

As for performance, given sufficient RAM ZFS performance for me is anywhere from close to ext4 to surpassing ext4, depending on memory, available pool space, and compressibility of data. ZFS's biggest disadvantage in my opinion is memory usage: If you have less than 16 GiB of RAM for a production server, you may want to avoid ZFS. That is an overly-simplified rule of thumb; there is much information online about memory requirements for ZFS. I personally run a 10TB pool and an 800GB pool along with some backup pools on a home office linux system with 32GB RAM and performance is great. This server also runs LXC and has multiple services running.

ZFS features go well beyond the data checksumming and self-healing capabilities; it's powerful snapshots are much better than LVM snapshots and it's inline lz4 compression can actually improve performance by reducing disk writes. I personally achieve a 1.55x savings on the 10TB pool (storing 9.76GiB of data in only 6.3GiB of space on disk)

In my experience, ZPF performance caters when the pool reaches 75% or 80% usage, so as long as you stay below that point, performance should be more than sufficient for general home/SMB-usage.

In the cases I have seen ZFS detect and correct bad data, the root cause was unclear but was likely a bad disk block. I also have EEC memory and use a UPS, so I don't believe the data was corrupted in RAM. In fact, you need EEC RAM to get the benefit from ZFS checksums. However I have seen a handful (~10-15) cases of blocks which failed checksums over the past 6 years. One major advantage of ZFS over an md RAID is that ZFS knows which files are affected by a checksum error. So in cases where a backup pool without redundancy had a checksum error, ZFS told me the exact files which were affected, allowing me to replace those files.

Despite the license ZFS uses not being comparable with the linux kernel, installing the modules is very easy (at least on Debian) and, once familiar with the toolset, management is straightforward. Despite many people citing fear of total data loss with ZFS on the internet, I have never lost any data since making the move to ZFS, and the combination of ZFS snapshots and data checksums/redundancy has personally saved me from data loss multiple times. It's a clear win and I'll personally never go back to an md array.


Solution 4:

How likely am I to even encounter actual data corruption making files unreadable? How?

Given enough time, it's almost certain to happen. Coincidentally, it happened to me last week. My home file server developed some bad RAM that was causing periodic lockups. Eventually I decided to simply retire the machine (which was getting rather old) and moved the drives to an enclosure on a different machine. The post-import scrub found and repaired 15 blocks with checksum errors, out of an 8TB pool, which were presumably caused by the bad RAM and/or the lockups. The disks themselves had a clean bill of health from SMART, and tested fine on a subsequent scrub.

Can Ext4 or the system file manager already detect data errors on copy/move operations, making me at least aware of a problem?

No, not really. There might be application-level checksums in some file formats, but otherwise, nothing is keeping an eye out for the kind of corruption that happened in my case.

What happens if one of the madam-Raid1 drives holds different data due to one drive having bad sectors? Will I still be able to retrieve the correct file or will the array be unable to decide which file is the correct one and lose it entirely?

If you know definitively that one drive is bad, you can fail that drive out of the array and serve all reads from the good drive (or, more sensibly, replace the bad drive, which will copy the data from the good drive onto the replacement). But if the data on the drives differs due to random bit flips on write (the kind of thing that happened to me and shodanshok) there is no definitive way to choose which of the two is correct without a checksum.

Also, md generally won't notice that two drives in a mirror are out of sync during normal operation — it will direct reads to one disk or the other in whatever way will get the fastest result. There is a 'check' function that will read both sides of a mirror pair and report mismatches, but only if you run it, or if your distribution is set up to run it periodically and report the results.