How do services with high uptime apply patches without rebooting?

There are various utilities in different operating systems which allow hot-patching of running code. An example of this would be kpatch and livepatch features of Linux which allow patching the running kernel without interrupting its operations. Its capabilities are limited and can only make trivial changes to the kernel, but this is often sufficient for mitigating a number of critical security issues until time can be found to do a proper fix. This kind of technique in general is called dynamic software updating.

I should point out though that the sites with virtually no downtime (high-availability) are not so reliable because of live-patching, but because of redundancy. Whenever one system goes down, there will be a number of backups in place that can immediately begin routing traffic or processing requests with no delay. There are a large number of different techniques to accomplish this. The level of redundancy provides significant uptime measured in nines. A three nine uptime is 99.9%. Four nine uptime is 99.99%, etc. The "holy grail" is five nines, or 99.999% uptime. Many of the services you listed have five nine availability due to their redundant backup systems spread throughout the world.

I watched a presentation at a security conference by a Netflix employee. They don't patch at all. Instead, when a patch is required, they stand up new instances and then blow away the unpatched ones. They are doing this almost constantly. They call it red-black deployment.

The short answer is:

They do reboot.

You seem to assume that Amazon and Google run on a single server, and if that is rebooted, the whole site/service is down. This is very far from the truth - large services typically run on many servers that work in parallel. For further reading, look at techniques like clustering, load balancing and failover.

Google, for example, has over a dozen data centers across the globe, and each holds a huge number of servers (estimates are 100,000-400,000 servers per center).

In such environments, updates (both feature and security updates) are typically installed as rolling deployments:

pick some subset of servers
install updates on the subset
reboot the subset; in the meantime the other servers take over
repeat with next subset :-)

There are other options, such as hot patching, but they are not used as frequently in my experience, at least not on typical large websites. See forest's answer for details.

How do services with high uptime apply patches without rebooting?

Tags:

Updates

Patching

Related

Recent Posts