Best way to test new HDD's for a cheap storage server

Solution 1:

These are new disks. Either they're going to fail or they won't. You're already a huge step ahead by using the ZFS filesystem, which will give you great insight into your raid and filesystem health...

I wouldn't do anything beyond just building the array. That's the point of the redundancy. You're not going to be able to induce a drive failure with the other listed methods.

Solution 2:

I had the same question 2 months ago. After sending in a failed disk, the replacement disk failed in my NAS after 3 days. So I decided I would now test the new replacement before putting it in production. I do not test every new disk I buy, only on 'refurbished' disks, which I do not completely trust.

If you decide you want to test these disks I would recommend running a badblocks scan and an extended SMART test on the brand new hard disk.

On a 2TB disk this takes up to 48 hours, The badblock command writes the disk full with a pattern, then reads the blocks again to see if the pattern is actually there, and will repeat this with 4 different patterns.

This command will probably not actually show up any bad blocks on a new disk, since disks reallocate bad blocks these days.

So before and after this I ran a smart test, and check the reallocated and current pending sector count. If any of these have gone up, your disk has some bad blocks already and so might prove untrustworthy.

After this I run an extended SMART test again.

You might want to install smartctl or smartmontools first.

Warning, the badblocks -w flag will overwrite all data on your disk, if you just want to do a read check, without overwriting the disk, use badblocks -vs /dev/sdX

sudo smartctl -a /dev/sdX
# record these numbers
sudo badblocks -wvs /dev/sdX
# let it run for 48 hours
sudo smartctl -a /dev/sdX
# compare numbers
sudo smartctl -t long /dev/sdX
# this might take another hour or 2, check results periodically with
sudo smartctl -a /dev/sdX

If after this your smart values seem ok I would trust the disk.

To know what each smart value means, you can start looking here

http://en.wikipedia.org/wiki/Self-Monitoring,_Analysis,_and_Reporting_Technology


Solution 3:

You can use Bonnie++ for testing. It can perfectly emulate file server behavior pattern.

For example:

# bonnie++ -u nobody -d /home/tmp -n 100:150000:200:100 -x 300

Test will run as user 'nobody' and will create/rewrite/delete 100*1024 files, from 200 to 150000 bytes per file, within 100 autocreated directories below /home/tmp. And number of tests = 300. You can play around file count/size and number of test repeats.


Solution 4:

I work for a company that does this sort of testing day in and day out. And Yes, we test every single hard drive we buy. Our process starts with running the drives through a free DOS based program called HDAT2. Its free to download. It can access SMART and some other features of the drive that are inaccessible from a Windows environment. Depending on the results there we will run them through one of several different lines of specialized hardware, but at the core they mostly just run SMART short self test, Long Test, a secure erase and an All Read to verify the sectors. My suggestion is would be to run a secure erase of the full disk, then run an all read, then a SMART short self-test. This order is important as a short self-test may not find anything if run at the beginning of your testing but after a full write and read of the disc it may pick up something. Hope this helps.


Solution 5:

I usually just do a full RAID init and where applicable, begin to populate the file system during this, all the time knowing that there might be a problem due to dead drives . This way, I don't waste any time for some kind of tests that are quite unreliable anyway and I would catch the real weak drives immediately. After that, there might be still some elevated chance for drive failures due to "infant mortality", but there is no practical way to eliminate this.

In practice, none of the last few hundred disks I used in a RAID had any issues during the first year of operation.

Tags:

Storage