If a RAID5 system experiences a URE during rebuild, is all the data lost?

Solution 1:

It really depends on the specific RAID implementation:

  • most hardware RAID will abort the reconstruction and some will also mark the array as failed, bringing it down. The rationale is that if an URE happens during a RAID5 rebuild it means some data are lost, so it is better to completely stop the array rather that risking silent data corruption. Note: some hardware RAID (mainly LSI based) will instead puncture the array, allowing the rebuild to proceed while marking the affected sector as unreadable (similar to how Linux software RAID behaves).

  • linux software RAID can be instructed to a) stop the array rebuild (the only behavior of "ancient" MDRAID/kernels builds) or b) continue with the rebuild process marking some LBA as bad/inaccessible. The rationale is that it is better to let the user do his choice: after all, a single URE can be on free space, not affecting data at all (or affecting only unimportant files);

  • ZRAID will show some file as corrupted, but it will continue with the rebuild process (see here for an example). Again, the rationale is that it is better to continue and report back to the user, enabling him to make an informed choice.

Solution 2:

If URE will happen you'll experience some data corruption over the block which is typically 256KB-1MB in size, but this doesn't mean ALL the data on your volume would be lost. What's not so great about RAID5 is a totally different thing: Rebuild itself is stressful and there're high chances you'll get second disk failure in a row. In such a case all the data would be lost.

Solution 3:

I would explain it the other way around;

If the RAID controller don’t stop on URE, what could happen ?

I lived it on a server, the RAID never noticed the URE and after the rebuild a corruption started to build up on the entire RAID volume.

The disk started to get more bad sector after the rebuild and the data started to be corrupt.

The disk was never kicked off the RAID volume, the controller fail is job to protect the data integrity.

That example is wrote to make you think that a controller can’t thrust a volume with URE at all, its for the data integrity, as the volume is not meant to be a backup but a resiliance to a disk failure