How to avoid downtime with linux?

Solution 1:

There is an important distinction between making a service highly available and making an individual machine highly available.

In most cases the goal is to make the service highly available, and availability of individual machines is only a means toward achieving that goal. However there is a limit in how far towards the goal you can get by improving availability of individual machines.

Even if you could take away all the downtime due to needing to update software the individual machines will still not be 100% available. Thus to increase the availability of the service above the availability of individual machines you have to design redundancy at a higher level. The last sentence of your question shows that at least in principle you know this.

If you do design a service to be more available than individual machines can deliver there is no longer pressure to achieve high availability of individual machines. Thus for highly available services there is no need to avoid reboots. Instead you can sacrifice some reliability of individual machines to make savings which can be put towards other areas where you can get much higher gains in reliability.

Once the high level system is design to be reliable in case of individual hardware components failing the live patching of kernels changes from being an advantage to becoming a risk.

It's a risk because there can be subtle differences between the behavior of a machine which was live patched and a machine which was booted with the newest kernel version. This can introduce a latent bug that can cause an outage next time a machine is rebooted. This risk is amplified by rebooting to get a clean slate being seen as a method to mitigate some outages.

One day you could have an outage where you think rebooting the machine might help. But as you reboot you are hit by the latent bug preventing the machine from coming back in the desired state. Live patching is not the only way such a latent bug can happen, it could as well happen due to something as mundane as a service having been enabled manually and never configured to start during boot, or having been configured to start too early such that it fails to come up due to unsatisfied dependencies.

For those reasons a highly available service may actually be easier to achieve with regular reboots of individual machines at a slow enough rate that you can detect problems and pause the sequence of reboots once problems do happen.

Solution 2:

To your question, "Are there Linux distributions/processes where upgrades/patches never require reboots?", I'm not aware of any, and I'm highly doubtful that there ever will be any which are truly reboot-free. In addition to Michael Hampton's comment about why live patching is not an out-of-the-box experience anywhere, live patching also doesn't achieve the same result as rebooting.

An anecdote to illustrate this: I recently investigated a problem where one particular utility had started segfaulting on a large number of machines. I tried looking at the shared libraries which it used to see if anything recently upgraded had broken it; ldd said it wasn't an executable (even though when I pulled the same binary down to my laptop, ldd could see the shared library dependencies just fine). I tried stepping through it in gdb; it segfaulted before it even got to the first instruction.

Looking at the timing of the fault, I found that a Ksplice patch had been recently applied. I backed out the patch and the binary didn't segfault, then added it back in, and it started segfaulting again. Rebooting onto equivalently-patched kernel worked fine. It turned out to be a patch for 32-bit support which the Ksplice folks had not applied quite correctly. To their credit, they issued a fixed patch within a few hours and it was back to working correctly on our fleet without intervention.

Another example: the Meltdown/Spectre patches were so invasive that the Ubuntu kernel team decided that live patching was impractical and required people to reboot their systems into the fixed kernel before receiving live patches again.

We run a large fleet of physical and virtual servers at work, with a large number of both Ksplice and Canonical Livepatch systems. They've both been far more reliable than a lot of other software, but I would still rather see our services designed with a reboot-friendly architecture than rely on kernel live patching.