How should I burn in hard drives?

Solution 1:

IMNSHO, you shouldn't be relying on a burn-in process to weed out bad drives and "protect" your data. Developing this procedure and implementing it will take up time that could be better used elsewhere and even if a drive passes burn-in, it may still fail months later.

You should be using RAID and backups to protect your data. Once that is in place, let it worry about the drives. Good RAID controllers and storage subsystems will have 'scrubbing' processes that go over the data every so often and ensure everything is good.

Once that all is taken care of, there's no need to do disk scrubbing, though as others have mentioned it doesn't hurt to do a system load test to ensure that everything is working as you expect. I wouldn't worry about individual disks at all.


As has been mentioned in the comments, it doesn't make a lot of sense to use hard drives for your particular use case. Shipping them around is far more likely to cause data errors that won't be there when you did the burn-in.

Tape media is designed to be shipped around. You can get 250MBps (or up to 650MBps compressed) with a single IBM TS1140 drive which should be faster than your hard drive. And bigger as well - a single cartridge can give you up to 4TB (uncompressed).

If you don't want to use tape, use SSDs. They can be treated far rougher than HDDs and satisfy all the requirements you've given so far.


After all that, here are my answers to your questions:

  • How important is it to burn in a hard drive before you start using it?
    Not at all.
  • How do you implement a burn-in process?
    • How long do you burn in a hard drive?
      One or two runs.
    • What software do you use to burn in drives?
      A simple run of, say, shred and badblocks will do. Check the SMART data afterwards.
  • How much stress is too much for a burn-in process?
    No stress is too much. You should be able to throw anything at a disk without it blowing up.

Solution 2:

How important is it to burn in a hard drive before you start using it?

If you have a good backup, and good high-availability systems, then not very much. Since restoring from a failure should be pretty easy.

How do you implement a burn-in process? What software do you use to burn in drives? How much stress is too much for a burn-in process?

I will typically run badblocks against a drive or new system when I get it. I will run it whenever I resurrect a computer from the spares pile. A command like this (badblocks -c 2048 -sw /dev/sde) will actually write to every block 4 times each time with a different pattern (0xaa, 0x55, 0xff, 0x00). This test does not do anything to test lots of random reads/writes, but it should prove that every block can be written too and read.

You could also run bonnie++, or iometer which are benchmarking tools. These should try to stress your drives a bit. Drives shouldn't fail even if you try to max them out. So you might as well try to see what they can do. I do not do this though. Getting an I/O benchmark of your storage system right at install/setup time may be very useful in the future when you are looking at performance issues.

How long do you burn in a hard drive?

A single run of badblocks is enough in my opinion, but I believe I have a very strong backup system, and my HA needs are not that high. I can afford some downtime to restore service on most of the systems I support. If you are so worried, that you think a multi-pass setup may be required, then you probably should have RAID, good backups, and a good HA setup anyway.

If I am in a rush, I may skip a burn-in. My backups, and RAID should be fine.


Solution 3:

Given your clarification, it doesn't sound like any burn-in process would be of any use to you. Drives fail primarily because of mechanical factors, usually heat and vibration; not because of any sort of hidden time-bomb. A "burn-in" process tests the installation environment as much as anything else. Once you move the thing, you're back to where you started.

But here are a few pointers that might help you:

Laptop drives are usually designed to withstand a more jostling and vibration than desktop drives. My friends who work in data-recovery shops always ship data to clients on laptop drives for that reason. I've never tested this fact, but it seems to be "common knowledge" in select industries.

Flash drives (e.g. USB thumb drives) are about the most shock-resistant of any medium you'll find. It should be even less likely that you'll loose data in transit if you use flash media.

If you ship a Winchester drive, do a surface scan before putting it in use. Or better yet, just don't put it into use. Instead, you may want to designate certain drives as "shipping" drives, which see all the abuse, but which you don't rely on for data integrity. (I.e.: copy data onto the drive for shipping, copy off after shipping, very checksums on both sides, that kind of thing).


Solution 4:

Your process is wrong. You should use raid arrays. Where I work we have made ruggedized raid arrays that are designed to get transported around. It's not rocket science.

Shock mounting the drives in oversize enclosures with big rubber vibration isolators will improve reliability hugely. (Seagate constellation-es drives, are as an example rated for 300G shock but only 2G vibration, non-operating: so the shipping case needs to vibration isolate the drive. http://www.novibes.com/Products&productID=62 or http://www.novibes.com/Products&productId=49 [part #50178])


However, if you really want to burn-in-test hard drives, so here it goes.

I've worked on systems like hard drives and burn in found some problems but...

For accelerated lifecycle testing of PCBs to bring out faults, nothing beats some hot/cold cycles. ( operating hot-cold cycles works even better... but it's harder for you to do, especially with banks of HDD's)

Get yourself an environmental chamber big enoug for the number of drives you acquire at a time. ( These are pretty expensive, it'd be cheaper to ship raid arrays around) You can't skimp on the test chambers you will need humidity control and programmable ramps.

Program in two repeating temperature ramps, down to minimum storage temp, up to maximum storage temp, make the ramps steep enough to upset the application engineer from you hard drive manufacturer. 3 cold-hot cycles in 12 hours should see the drives failing pretty quickly. Run the drives at least 12 hours like this. If any work afterwards I'll be surprised.

I didn't think this up: One place I worked we had a production engineer did this, to get more products shipped with the same test equipment, there was a huge surge in faults in test, but the dead on arrival rate dropped to practically zero.


Solution 5:

I disagree with all the answers that basically say "Don't bother with burn-in, have good backups".

While you should always have backups, I spent 9 hours yesterday (on top of my usual 10-hour shift) restoring from backups because the system was running with drives that hadn't been burned in.

There were 6 drives in a RAIDZ2 config (ZFS equivalent to RAID-6) and we had 3 drives die over the course of 18 hours on a box that had been running for approximately 45 days.

The best solution I've found is to purchase drives from one particular manufacturer (don't mix-and-match), then run their provided tool for exercising the drives.

In our case we buy Western Digital and use their DOS-based drive diagnostics from a bootable ISO. We fire it up, run the option to write random garbage to the entire disk, then run the short SMART test followed by the long SMART test. That's usually enough to weed out all the bad sectors, read/write reallocations, etc...

I'm still trying to find a decent way to 'batch' it so I can run it against 8 drives at a time. Might just use 'dd if=/dev/urandom of=/dev/whatever' in Linux or 'badblocks'.

EDIT: I found a nicer way to 'batch' it. I finally got around to setting up a PXE boot server on our network to address a particular need, and noticed that the Ultimate Boot CD can be PXE booted. We now have a handful of junk machines sitting around that can be PXE booted to run drive diagnostics.