How can I do disk surface scanning, and fix/reallocate bad sectors in Linux from the command line?

This answer is about magnetic disks. SSDs are different. Also, this is disk with no data (or no data you care to preserve) on it; see my answer to “Can I fix bad blocks on my hard disk with a single command” for what to do if you have important data on the disk.

Disks made since at least the late 90s manage bad blocks themselves. In brief, a disk will handle a bad block by transparently replacing it with a spare sector. It will do so if (a) while reading, it discovers the block is "weak", but ECC is enough to recover the data; (b) while writing, it discovers the sector header is bad; (c) while writing, if a read previously detected the sector as bad, but the data was not recoverable.

The disk firmware typically lets you monitor this process (the counts at least) via SMART attributes. Typically there will be at least a count of reallocated sectors and two counts of pending (discovered bad on read, ECC failed, has not yet been written to).

There are two ways to get the disk to notice bad sectors:

  1. Use smartctl -t offline /dev/sdX to tell the disk firmware to do an offline surface scan. You then just leave the disk alone (completely idle will be fastest) until it's done (check the "Offline data collection status" in smartctl -c /dev/sdX). This will typically update the "offline uncorrectable" count in SMART. (Note: drives can be configured to automatically run an offline check routinely.)

  2. Have Linux read the entire disk, e.g., badblocks -b 4096 -c 1024 -s /dev/sdX. This will typically update the "current pending sector" count in SMART.

Either of the above may also increase the reallocated sector count—this is case (b), the ECC recovered the data.

Now, to recover the sectors you just need to write to them. Normally, that'd be a simple pv -pterba /dev/zero > /dev/sdX (or just plain cat, or dd) but you plan to make these part of a RAID array. The RAID init will write to the entire disk anyway, so that's pointless. The only exception the beginning and end of the disk—it's possible a few tens of megabytes will be missed (due to alignment, headers, etc.). So:

end=$(echo "$(/sbin/blockdev --getsize64 "$disk")/4096-32768" | bc)
dd if=/dev/zero bs=4096             count=32768 of="$disk"   # first 128 MiB
dd if=/dev/zero bs=4096 seek="$end" count=32768 of="$disk"   # last 128 MiB

I think I managed to avoid the all-to-easy fencepost error1 above, so that should blank the first and last 128MiB of the disk. Then let mdadm raid init write the rest. It's harmless (except for trivial wear, and wasting hours of time) to zero the whole disk if you'd like to, though.

Another thing to do, if your disks support it: smartctl -l scterc,40,100 (or whatever numbers) to tell the disk that you want it to give up on correcting read errors quicker—40 would be 4 seconds. The two numbers are read errors and write errors; mdraid will easily correct read errors via parity (and write the failed sector back to the disk to let it reallocate). Write errors, though, will fail the disk out of the array.

PS: Make sure to keep an eye on the reallocated sectors count. That attribute going to failed is bad news. And if its continuously increasing, that's bad news too.

PPS: Make sure your RAID arrays are scrubbed (every sector read and all the parity verified) routinely. Many distros already ship a script that does this monthly. This will detect & repair any new bad blocks as otherwise seldom-read bad blocks can linger and ultimately cause rebuild failure.

1 Fencepost error—a type of off-by-one error from failing to count one of the ends. Named from if you have a fence post every 3ft, how many fence posts in a 9ft freestanding fence? The correct answer is 4; the fencepost error is 3 and is from not counting the post at the beginning or at the end.



Hard Disk