Silent disk errors and reliability of Linux swap

We trust the integrity of the data retrieved from swap because the storage hardware has checksums, CRCs, and such.

In one of the comments above, you say:

true, but it won't protect against bit flips outside of the disk itself

"It" meaning the disk's checksums here.

That is true, but SATA uses 32-bit CRCs for commands and data. Thus, you have a 1 in 4 billion chance of corrupting data undetectably between the disk and the SATA controller. That means that a continuous error source could introduce an error as often as every 125 MiB transferred, but a rare, random error source like cosmic rays would cause undetectable errors at a vanishingly small rate.

Realize also that if you've got a source that causes an undetected error at a rate anywhere near one per 125 MiB transferred, performance will be terrible because of the high number of detected errors requiring re-transfer. Monitoring and logging will probably alert you to the problem in time to avoid undetected corruption.

As for the storage medium's checksums, every SATA (and before it, PATA) disk uses per-sector checksums of some kind. One of the characteristic features of "enterprise" hard disks is larger sectors protected by additional data integrity features, greatly reducing the chance of an undetected error.

Without such measures, there would be no point to the spare sector pool in every hard drive: the drive itself could not detect a bad sector, so it could never swap fresh sectors in.

In another comment, you ask:

if SATA is so trustworthy, why are there checksummed file systems like ZFS, btrfs, ReFS?

Generally speaking, we aren't asking swap to store data long-term. The limit on swap storage is the system's uptime, and most data in swap doesn't last nearly that long, since most data that goes through your system's virtual memory system belongs to much shorter-lived processes.

On top of that, uptimes have generally gotten shorter over the years, what with the increased frequency of kernel and libc updates, virtualization, cloud architectures, etc.

Furthermore, most data in swap is inherently disused in a well-managed system, being one that doesn't run itself out of main RAM. In such a system, the only things that end up in swap are pages that the program doesn't use often, if ever. This is more common than you might guess. Most dynamic libraries that your programs link to have routines in them that your program doesn't use, but they had to be loaded into RAM by the dynamic linker. When the OS sees that you aren't using all of the program text in the library, it swaps it out, making room for code and data that your programs are using. If such swapped-out memory pages are corrupted, who would ever know?

Contrast this with the likes of ZFS where we expect the data to be durably and persistently stored, so that it lasts not only beyond the system's current uptime, but also beyond the life of the individual storage devices that comprise the storage system. ZFS and such are solving a problem with a time scale roughly two orders of magnitude longer than the problem solved by swap. We therefore have much higher corruption detection requirements for ZFS than for Linux swap.

ZFS and such differ from swap in another key way here: we don't RAID swap filesystems together. When multiple swap devices are in use on a single machine, it's a JBOD scheme, not like RAID-0 or higher. (e.g. macOS's chained swap files scheme, Linux's swapon, etc.) Since the swap devices are independent, rather than interdependent as with RAID, we don't need extensive checksumming because replacing a swap device doesn't involve looking at other interdependent swap devices for the data that should go on the replacement device. In ZFS terms, we don't resilver swap devices from redundant copies on other storage devices.

All of this does mean that you must use a reliable swap device. I once used a $20 external USB HDD enclosure to rescue an ailing ZFS pool, only to discover that the enclosure was itself unreliable, introducing errors of its own into the process. ZFS's strong checksumming saved me here. You can't get away with such cavalier treatment of storage media with a swap file. If the swap device is dying, and is thus approaching that worst case where it could inject an undetectable error every 125 MiB transferred, you simply have to replace it, ASAP.

The overall sense of paranoia in this question devolves to an instance of the Byzantine generals problem. Read up on that, ponder the 1982 date on the academic paper describing the problem to the computer science world, and then decide whether you, in 2019, have fresh thoughts to add to this problem. And if not, then perhaps you will just use the technology designed by three decades of CS graduates who all know about the Byzantine Generals Problem.

This is well-trod ground. You probably can't come up with an idea, objection, or solution that hasn't already been discussed to death in the computer science journals.

SATA is certainly not utterly reliable, but unless you are going to join academia or one of the the kernel development teams, you are not going to be in a position to add materially to the state of the art here. These problems are already well in hand, as you've already noted: ZFS, btrfs, ReFS... As an OS user, you simply have to trust that the OS's creators are taking care of these problems for you, because they also know about the Byzantine Generals.

It is currently not practical to put your swap file on top of ZFS or Btrfs, but if the above doesn't reassure you, you could at least put it atop xfs or ext4. That would be better than using a dedicated swap partition.


Swap has ??? <--- this is my question

Swap is still not protected in Linux (but see UPD).

Well, of course there's ZFS on Linux that is capable of being a swap storage but there's still a lock-up under some circumstances — thus effectively revoking that option.

Btrfs still can't handle swap files. They mention possible use of loopback although it's noted to be of a poor performance. There's an indication unclear that Linux 5 could have it finally(?)…

Patches to protect conventional swap itself with checksums didn't make it to mainstream.

So, all-in-all: nope. Linux still has a gap there.

UPD.: As @sourcejedi points out there's such a tool as dm-integrity. Linux kernel since version 4.12 has gotten device-mapper's target that can be put into a use for providing checksums to any general block devices and those that are for swap isn't exception. The tooling isn't broadly incorporated into major distros and most of them don't have any support in udev sub-system, but eventually this should change. When paired with a redundancy provider, say put onto a top of MD aka Linux Software RAID, it should be possible not only to detect bit rot but also to re-route I/O request to healthy data because dm-integrity would indicate there's an issue and MD should handle it.


dm-integrity

See: Documentation/device-mapper/dm-integrity.txt

dm-integrity would normally be used in journalling mode. In the case of swap, you could arrange to do without the journalling. This could significantly lower the performance overhead. I am not sure whether you would need to reformat the swap-over-integrity partition on each boot, to avoid catching errors after an unclean shutdown.

In the initial announcement of dm-integrity, the author states a preference for "data integrity protection on the higher level" instead. In the case of swap, that would open the possibility of storing the checksums in RAM. However, that option would both require non-trivial modifications to the current swap code, and increase memory usage. (The current code tracks swap efficiently using extents, not individual pages / sectors).


DIF/DIX?

DIX support was added by Oracle in Linux 2.6.27 (2008).

Does using DIX provide end-to-end integrity?

You could consult your vendor. I don't know how you could tell if they are lying about it.

DIX is required to protect data in flight between the OS (operating system) and the HBA.

DIF on its own increases protection for data in flight between the HBA and the storage device. (See also: presentation with some figures about the difference in error rates).

Precisely because the checksum in the guard field is standardized, it is technically possible to implement DIX commands without providing any protection for data at rest. Just have the HBA (or storage device) regenerate the checksum at read time. This outlook was made quite clear by the original DIX project.

  • DIF/DIX are orthogonal to logical block checksums
    • We still love you, btrfs!
    • Logical block checksum errors are used for detection of corrupted data
    • Detection happens at READ time
    • ... which could be months later, original buffer is lost
    • Any redundant copies may also be bad if original buffer was garbled
  • DIF/DIX are about proactively preventing corruption
    • Preventing bad data from being stored on disk in the first place
    • ... and finding out about problems before the original buffer is erased from memory

-- lpc08-data-integrity.pdf from oss.oracle.com

One of their early postings about DIX mentions the possibility of using DIX between OS and HBA even when the drive does not support DIF.

Complete mendacity is relatively unlikely in "enterprise" contexts where DIX is currently used; people would notice it. Also, DIF was based on existing hardware which could be formatted with 520-byte sectors. The protocol for using DIF allegedly requires that you first reformat the drive, see e.g. the sg_format command.

What is more likely is an implementation that does not follow the true end-to-end principle. To give one example, a vendor is mentioned which supports a weaker checksum option for DIX to save CPU cycles, which is then replaced by a stronger checksum further down in the stack. This is useful, but it is not complete end-to-end protection.

Alternatively, an OS could generate its own checksums and store them in the application tag space. However there is no support for this in current Linux (v4.20). The comment, written in 2014, suggests this might be because "very few storage devices actually permit using the application tag space". (I am not certain whether this refers to the storage device itself, the HBA, or both).

What sort of DIX devices are available that work with Linux?

The separation of the data and integrity metadata buffers as well as the choice in checksums is referred to as the Data Integrity Extensions [DIX]. As these extensions are outside the scope of the protocol bodies (T10, T13), Oracle and its partners are trying to standardize them within the Storage Networking Industry Association.

-- v4.20/Documentation/block/data-integrity.txt

Wikipedia tells me DIF is standardized in NVMe 1.2.1. For SCSI HBAs, it seems a bit difficult to pin this down if we don't have a standard to point to. At the moment it might be most precise to talk about "Linux DIX" support :-). There are devices available:

SCSI T10 DIF/DIX [sic] is fully supported in Red Hat Enterprise Linux 7.4, provided that the hardware vendor has qualified it and provides full support for the particular HBA and storage array configuration. DIF/DIX is not supported on other configurations, it is not supported for use on the boot device, and it is not supported on virtualized guests.

At the current time, the following vendors are known to provide this support...

-- RHEL 7.5 Release Notes, Chapter 16. Storage

All the hardware mentioned in the RHEL 7.5 release notes is Fibre Channel.

I don't know this market. It sounds like DIX might become more widely available in servers in future. I don't know any reason why it would become available for consumer SATA disks - as far as I know there isn't even a de-facto standard for the command format. I'll be interested to see if it becomes available more widely on NVMe.