Updating production Ubuntu boxes the dos and don'ts
There is nothing special about patching Ubuntu vs. Windows, RHEL, CentOS, SuSE, debian, etc.
The basic state of mind you need to be in when designing your patch procedure is to assume something will break.
Some of the basic guidelines I tend to use when designing a patch setup are:
- Always use a local system to centralize internally to your network where the patches are installed from
This may include using WSUS, or mirrors of
<your_os_here> to an internal patch management machine. Preferable one that can centrally query and let you know the status of patches installed on your individual machines.
- Pre-stage the installations - when possible - on the machines.
When it is possible, as patches come out have the central server copy them down to the individual machines. This is really just a time saver so that you don't have to wait for them to download AND install, you just have to kick off the install during your patch window.
- Get an outage window to install the patches in, you might have to reboot, and something probably will break. Make sure the stake holders for those systems are aware there are patches being deployed. Be prepared for the "this" doesn't work calls.
In keeping with my basic theory that patches break things, make sure you have an outage window to apply patches long enough to troubleshoot critical problems, and possibly roll the patch back. You don't necissarally need to have people sitting there testing after patches. Personally I rely heavily on my monitoring systems to let me know everything is functioning at the very minimum level that we can get away with. But also be prepared for little nagging issues to be called in as people get to work. You should always have someone scheduled to be there ready to answer the phone - preferable not the guy who was up till 3am patching the boxes.
- automate as much as possible
Like everything else in IT, script,script then script some more. Script the package download, the installation start, the mirror. Basically you want to turn patch windows into a baby sitting assignment that just needs a human there in case things break.
- Have multiple windows each month
This gives you the ability to not patch some servers if for whatever reason they can't be patched on "the appointed night". If you can't do them on night 1, require they be free on night 2. Also lets you keep the number of servers patched at the same time sane.
Most importantly keep up with the patches! If you don't you'll find your self having to do very large 10+ hour patch windows just to get back to the point where you are caught up. Introducing even more points where things could go wrong, and making finding which patch caused and issue that much harder.
The other part of this problem is, keeping up with patches is 'a good thing', but patches are released almost daily. How many scheduled outages does one have to make if there is a new security patch available every single day?
Patching a server once a month or once every other month is - IMHO - a very achievable, and acceptable goal. More than that, and well you'll constantly be patching servers, much less and you start getting into situations where you have hundreds of patches that need to be applied per server.
As far as how many windows you need a month? That depends on your environment. How many servers do you have? What is the required up time for your severs?
Smaller environments that are 9x5 can probably get away with one patch window a month. Large 24x7 shops may need two. Very large 24x7x365 may need a rolling window every week to have a different set of servers patched each week.
Find a frequency that works for you and your environment.
One thing to keep in mind is that 100% up to date is an impossible goal to reach - don't let your security department tell you otherwise. Do your best, don't fall too far behind.
Things to Do:
- Take a Backup
- Make sure it's a restorable backup (although, these two are general points)
- Try to direct traffic away from the production box while you upgrade.
- Try to have an out-of-band access method in case all goes wrong, KVM, serial console, Local access, or remote-hands.
- Test on one server, then make sure everything works, before deploying updates onto more servers
- Use puppet if you can to ensure version numbers are the same across multiple servers. (You can also use it to force upgrades)
- On a test server, diff the versions of the config files against the new (update installed) ones, and make sure nothing is going to seriously break things. I seem to recall dpkg asking before installing new versions that differ from currently installed ones.
Things to avoid:
- Doing updates at the middle of the day, or 09:00 on a monday morning, or 5pm on a friday afternoon! (thanks @3influence!)
- Upgrading MySQL on really big database servers (restart could take a long time)
- Doing all your servers at once (especially kernels)
- Doing anything that might change /etc/networks (because you could lose connectivity)
- Automated updates that could do the above without you being there to check everything is OK.
Another point worth making: If you're used to Windows, you'll be surprised that most of the Linux updates do not require downtime or rebooting. Some do, such as kernel updates. But updates that require rebooting or downtime are usually flagged as such, and can be handled on a separate schedule.
Our Ubuntu machines are all running LTS releases.
We just automatically install all of the updates - sure it's not "best practice", but we're a relatively small shop and don't have a test/dev/production environment for every single service. The LTS updates are generally fairly well tested and minimally invasive anyway.
Upgrading to a new release is obviously a little more involved.
We deal with updates the following way for ubuntu LTS systems:
- Maitain a suite of acceptance tests that check all the critical paths in our software
- Install security upgrades unattended at 4am every morning and immediately run the acceptance tests. If anything fails, an engineer is paged and has plenty of time to fix things or roll back before 9am. This has so far happened only twice in five years - LTS is well tested and stable.
- We automatically redeploy our entire infrastructure every week (on digitalocean) with blue/green deployments, which keeps all packages at their latest versions. If a new deploy fails the acceptance tests, the deploy is on hold until an engineer can debug the issue.
The next logical step for us is to eliminate in-memory session information so we can simply redeploy the infrastructure every day or even multiple times per day without impacting customers and eliminate step (2).
This approach is low-maintenance and avoids maintenance windows completely.