How to recover from a drive failure in a RAID 5 configuration?

Solution 1:

The system is running very slowly because it has to reconstruct the missing data which involves additional CPU and I/O.

If you have a missing disk in a RAID-5 configuration you have no recovery strategy. If another disk goes down you will lose your data. Run, don't walk, to the nearest vendor from which you can get a compatible part covered by manufacturer's warranty shipped by a same-day urgent courier. If the vendor you bought the array from is already in the process of getting the part, get both parts and stash the other one away as a spare.

If you have a RAID-5 being used for a production system you should consider leaving a spare disk in the array as a hot spare.

Added - If your logs are not on a separate volume (physically separate disks) move them to a separate set of disks, even just a single mirrored pair. This will also be a performance win if your database has any significant load as contention on log volumes has a disproportionately bad effect on performance.

If this is possible you can also make your database more robust by doing the following:

  1. Shut down the database.
  2. Backup the database.
  3. Move the logs to a physically separate set of disks (make sure you reconfigure the database so it knows where the logs have been moved to).
  4. Restart the database and application.

If you have the logs on a separate volume you can restore and roll forward from the backup if and only if a disk failure does not compromise the logs. Database logs should be on a separate disk volume for (amongst others) the following reasons:

  • Logs usage patterns are predominantly sequential, appending log entries onto the end of the file (the file is in effect a ring buffer). This means that a large number of log entries can be written out quickly as there is little disk head seek activity.

  • If they are sharing physical disks with a heavily random access workload (e.g. a transactional tables and indexes) they will be slowed down disproportionately as the head seek activity disrupts the sequential writes.

  • Having the logs on a separate volume is almost always a performance win and only needs a single mirrored pair for logs to support quite a heavy workload. This means that the hardware to do it is quite cheap, so there is a small cost for a big performance and reliability win.

  • If your data array goes down the logs are not lost. If you have a proper backup strategy you can restore from the backup and roll foward from the logs. This means that a whole array can go down on the server without being a single point of failure. Both the log and data arrays have to fail simultaneously to cause data loss.

Solution 2:

1) Backup.

Right now no data has been lost. If your backups are not up to date backup now.

2) Read the manual, call the vendor etc.

Different RAID systems have different steps for replacing a disk, and done wrong you risk destroying the whole array. Without knowing what sort of RAID hardware/software you have we can only guess at the steps needed.

Also, the slow performance is because RAID 5 in a degraded state (i.e.: one disk dead) has horrible read performance. How horrible depends on how the parity is stored and which disk died, but the "good" news is slow performance with one disk gone is a known issue and not cause for panic.


Solution 3:

First I would read the manual for the hardware/software that you're using - the section for failure recovery :)

Should be a simple matter of replacing the disk and rebuilding the array though.

The most important point in such cases is that the disk should be replaced as soon as possible since if another disk fails you will probably lose data. Also you should address the cause of failure - was it because the disk was getting old? Should you replace the other ones too? Or was it because of a power surge, heat or vibration?