Something is burning in the server room; how can I quickly identify what it is?

Solution 1:

The general consensus seems to be that the answer to your question comes in two parts:

How do we find the source of the funny burning smell?

You've got the "How" pretty well nailed down:

  • The "Sniff Test"
  • Look for visible smoke/haze
  • Walk the room with a thermal (IR) camera to find hot spots
  • Check monitoring and device panels for alerts

You can improve your chances of finding the problem quickly in a number of ways - improved monitoring is often the easiest. Some questions to ask:

  • Do you get temperature and other health alerts from your equipment?
  • Are your UPS systems reporting faults to your monitoring system?
  • Do you get current-draw alarms from your power distribution equipment?
  • Are the room smoke detectors reporting to the monitoring system? (and can they?)

When should we troubleshoot versus hitting the Big Red Switch?

This is a more interesting question.
Hitting the big red switch can cost your company a huge amount of money in a hurry: Clean agent releases can be into the tens of thousands of dollars, and the outage / recovery costs after an emergency power off (EPO, "dropping the room") can be devastating.
You do not want to drop a datacenter because a capacitor in a power supply popped and made the room smell.

Conversely, a fire in a server room can cost your company its data/equipment, and more importantly your staff's lives.
Troubleshooting "that funny burning smell" should never take precedence over safety, so it's important to have some clear rules about troubleshooting "pre-fire" conditions.

The guidelines that follow are my personal limitations that I apply in absence of (or in addition to) any other clearly defined procedure/rules - they've served me well and they may help you, but they could just as easily get me killed or fired tomorrow, so apply them at your own risk.

  1. If you see smoke or fire, drop the room
    This should go without saying but let's say it anyway: If there is an active fire (or smoke indicating that there soon will be) you evacuate the room, cut the power, and discharge the fire suppression system.
    Exceptions may exist (exercise some common sense), but this is almost always the correct action.

  2. If you're proceeding to troubleshoot, always have at least one other person involved
    This is for two reasons. First, you do not want to be wandering around in a datacenter and all of a sudden have a rack go up in the row you're walking down and nobody knows you're there. Second, the other person is your sanity check on troubleshooting versus dropping the room, and should you make the call to hit the Big Red Switch you have the benefit of having a second person concur with the decision (helps to avoid the career-limiting aspects of such a decision if someone questions it later).

  3. Exercise prudent safety measures while troubleshooting
    Make sure you always have an escape path (an open end of a row and a clear path to an exit).
    Keep someone stationed at the EPO / fire suppression release.
    Carry a fire extinguisher with you (Halon or other clean-agent, please).
    Remember rule #1 above.
    When in doubt, leave the room. Take care about your breathing: use a respirator or an oxygen mask. This might save your health in case of chemical fire.

  4. Set a limit and stick to it
    More accurately, set two limits:

    • Condition ("How much worse will I let this get?"), and
    • Time ("How long will I keep trying to find the problem before its too risky?").

    The limits you set can also be used to let your team begin an orderly shutdown of the affected area, so when you DO pull power you're not crashing a bunch of active machines, and your recovery time will be much shorter, but remember that if the orderly shutdown is taking too long you may have to let a few systems crash in the name of safety.

  5. Trust your gut
    If you are concerned about safety at any time, call the troubleshooting off and clear the room.
    You may or may not drop the room based on a gut feeling, but regrouping outside the room in (relative) safety is prudent.

If there isn't imminent danger you may elect bring in the local fire department before taking any drastic actions like an EPO or clean-agent release. (They may tell you to do so anyway: Their mandate is to protect people, then property, but they're obviously the experts in dealing with fires so you should do what they say!)

We've addressed this in comments, but it may as well get summarized in an answer too -- @DeerHunter, @Chris, @Sirex, and many others contributed to the discussion

Solution 2:

A Thermal Imaging Camera could do the work, and let you identify where the overheating is. A device like this would let you identify also the origin of a fire or burning in a smoke filled room.


Solution 3:

You do none of these things that have been said. You leave the hazardous environment because whatever is being pumped through the entire room is dangerous to your health and may really mess up your lungs. If there is an acrid smell of something burning in the room that you can't find, call (911|112|999|whatever emergency number fits your jurisdiction) and let the fire (company|department|brigade) sort it out while they're on bottled air.

Computer parts contain all sorts of interesting chemicals including mercury, cadmium, lead, and lots of plastics in casings. Notice that all the links I made explain how low level exposures can cause lasting damage or even quick death. This is an environment that can be immediately dangerous to life and health.

... so really, if something is burning, don't spend hours sniffing the fumes. If you can't identify it and immediately act to contain it, get out.


Solution 4:

If you had proper monitoring on the UPS (usually via SNMP), the unit itself should have rung the bells on your monitoring system. If it didn't, talk to your vendor about that. It either malfunctioned or your monitoring system isn't properly configured.

If something active is actually burning, it should be complaining about it in some way, or simply be off the network, which should also cause an alarm.

If it's something like an actual power rail burning through insulation, and it's not on a smart PDU, then we're back to your original question, which is "how do I find a burning thing?" And I think the proper answer is "Hit the EPO and figure it out. Your production servers are probably not important enough to go risking lives."


Solution 5:

This is one of those situations where

XKCD Die Hard sysadmin

doesn't apply, you should call a professional

Firefighter in protective gear

Anything else is just plain stupid.

Tags:

Hardware