What would happen if a hard drive failed while the Linux kernel was running?

Hardware failures always run some risk of crashing the Kernel since those code paths generally have had much less testing, but normally, a failed hard drive should not crash the Kernel. What exactly happens depends on the nature of the failure. Perhaps only certain sectors are now unreadable rendering parts of the /home partition unreadable, the system will still be runnable for a sysadmin to analyze the problem. If the root filesystem becomes unusable, the system is pretty much dead regardless of a Kernel crash as even a simple shell won't be available. If a swap partition becomes unavailable, programs that are using swap will segment fault when it comes time to read in any swapped out data. If the hard drive that crashed is simply extra storage, it may have little affect besides some filesystems becoming unreadable.

It can also depend on what kind of errors the hard drive is throwing. I've seen a drive effectively disappear and besides the file systems disappearing, everything ran ok. I've also seen a hard drive continually hanging the system and throwing errors after a long timeout causing the whole system performance to degrade. If using a layer like MD running RAID1/4/5, a severe error will normally just cause the Kernel to mark the disk as failed, and it will ignore it relying on the remaining drives to keep the system running.


On my PowerEdge 2500, when I first got it, the PERC (hardware RAID) controller's firmware was not at the latest revision. The effect of this is that the root disk would just suddenly disappear and would no longer be accessible (very similar to if it was a removable drive and it was just suddenly disconnected).

I couldn't load any new programs, programs that were loaded kept running, but with errors if they attempted to write to the disk. Still had the bash prompt I had logged into, network continued to function. Was surprisingly not as catastrophic as I would have expected.

I think this is a "clean" failure though, because whatever driver was responsible for reading/writing to the PERC seemed to be rejecting everything immediately with an error (forget the exact one but it was a SCSI sense error). It would be much worse if the drive wasn't responding, responding slowly, or writes appeared to be working OK but really weren't.