Is bit rot on hard drives a real problem? What can be done about it?

Solution 1:

First off: Your file system may not have checksums, but your hard drive itself has them. There's S.M.A.R.T., for example. Once one bit too many got flipped, the error can't be corrected, of course. And if you're really unlucky, bits can change in such a way that the checksum won't become invalid; then the error won't even be detected. So, nasty things can happen; but the claim that a random bit flipping will instantly corrupt you data is bogus.

However, yes, when you put trillions of bits on a hard drive, they won't stay like that forever; that's a real problem! ZFS can do integrity checking every time data is read; this is similar to what your hard drive already does itself, but it's another safeguard for which you're sacrificing some space, so you're increasing resilience against data corruption.

When your file system is good enough, the probability of an error occurring without being detected becomes so low that you don't have to care about that any longer and you might decide that having checksums built into the data storage format you're using is unnecessary.

Either way: no, it's not impossible to detect.

But a file system, by itself, can never be a guarantee that every failure can be recovered from; it's not a silver bullet. You still must have backups and a plan/algorithm for what to do when an error has been detected.

Solution 2:

Yes it is a problem, mainly as the drive sizes go up. Most SATA drives have a URE (uncorrectable read error) rate of 10^14. Or for every 12TB of data read statistically the drive vendor says the drive will return a read fail (you normally can look them up on the drive spec sheets). The drive will continue to work just fine for all other parts of the drive. Enterprise FC & SCSI drive generally have a URE rate of 10^15 (120TB) along with a small number of SATA drives which helps reduce it.

I've never seen to disks stop rotating at the exact same time, but I have had a raid5 volume hit this issue (5 years ago with 5400RPM consumer PATA drives). Drive fails, it's marked dead and a rebuild occurs to the spare drive. Problem is that during the rebuild a second drive is unable to read that one little block of data. Depending upon whos doing the raid the entire volume might be dead or just that little block may be dead. Assuming it's only that one block is dead, if you try to read it you'll get an error but if you write to it the drive will remap it to another location.

There are multiple methods to protect against: raid6 (or equivalent) which protects against double disk failure is best, additional ones are a URE aware filesystem such as ZFS, using smaller raid groups so statistically you have a lower chance of hitting the the URE drive limits (mirror large drives or raid5 smaller drives), disk scrubbing & SMART also helps but is not really a protection in itself but used in addition to one of the above methods.

I manage close to 3000 spindles in arrays, and the arrays are constantly scrubbing the drives looking for latent URE's. And I receive a fairly constant stream of them (every time it finds one it fixes it ahead of the drive failure and alerts me), if I was using raid5 instead of raid6 and one of the drives went completely dead... I'd be in trouble if it hit certain locations.


Solution 3:

Hard drives do not generally encode data bits as single magnetic domains -- hard drive manufacturers have always been aware that magnetic domains could flip, and build in error detection and correction to drives.

If a bit flips, the drive contains enough redundant data that it can and will be corrected the next time that sector is read. You can see this if you check the SMART stats on the drive, as the 'Correctable error rate'.

Depending on the details of the drive, it should even be able to recover from more than one flipped bit in a sector. There will be a limit to the number of flipped bits that can be silently corrected, and probably another limit to the number of flipped bits that can be detected as an error (even if there is no longer enough reliable data to correct it)

This all adds up to the fact that hard drives can automatically correct most errors as they happen, and can reliably detect most of the rest. You would have to be have a large number of bit errors in a single sector, that all occurred before that sector was read again, and the errors would have to be such that the internal error detection codes see it as valid data again, before you would ever have a silent failure. It's not impossible, and I'm sure that companies operating very large data centres do see it happen (or rather, it occurs and they don't see it happen), but it's certainly not as big a problem as you might think.


Solution 4:

Modern hard drives (since 199x) have not only checksums but also ECC, which can detect and correct quite a bit "random" bit rot. See: http://en.wikipedia.org/wiki/S.M.A.R.T.

On the other hand, certain bugs in firmware and device drivers can also corrupt data in rare (otherwise QA would catch the bugs) occasions which would be hard to detect if you don't have higher level checksums. Early device drivers for SATA and NICs had corrupted data on both Linux and Solaris.

ZFS checksums mostly aim at the bugs in lower level software. Newer storage/database system like Hypertable also have checksums for every update to guard against bugs in filesystems :)


Solution 5:

Theoretically, this is cause for concern. Practically speaking, this is part of the reason that we keep child/parent/grandparent backups. Annual backups need to be kept for at least 5 years, IMO, and if you've got a case of this going back farther than that, the file is obviously not that important.

Unless you're dealing with bits that could potentially liquify someone's brain, I'm not sure the risk vs. reward is quite up to the point of changing file systems.