If RAID5/6 are risky due to URE, are full backup/restore schemes at risk too?

Solution 1:

The real problem regarding URE and RAID5 is that, upon encoutering even a single URE, many hardware controllers simply abort the RAID reconstruction and declare the array death, putting all your data offline. While this is the "safest bet" regarding potential data corruption, it is not always the best thing to do (eg: think about a URE affecting a free/not-allocated-by-filesystem data sector. While it should be safe to ignore it, the hardware controller will put the entire array offline).

RAID6 is much less prone to URE, as the two-disks redundancy really lowers the possibility to have concurrent UREs in the very same disk sector/LBA.

At the same time, software RAID (eg: mdadm) generally is much more flexible than hardware RAID, enabling the recovery of degraded RAID5 array even when some UREs are found.

Restoring from backup, you generally have more flexible tools in place; this means that in the common case you can skip the broken/unreadable sectors and go ahead with the recovery of other data.

Solution 2:

In principle, yes, but if you store your backup on a RAID6 (as an example), you will have the benefit of the redundancy, so the total error rate will be much lower, and with it the chance of an URE during recovery.

If you use a tape backup solution, the error rates are much lower to begin with (SAS: 1x10^-15 - 1x10^-16, LTO7: 1x10^-19).

Solution 3:

Anything on the volume is at risk

If you are stating that you have a concern regarding URE on a volume/LUN that has suffered from a RAID 5/6 drive failure then all of the data on that volume would be at risk.

Ensure you are storing your data on a different volume/LUN than your backup is stored on. Best practice would state your backup is on a completely different storage device than your production data.

URE is typically at the block level so anything on that volume would be at risk of corruption. Block level format is low in the stack. NTFS or VMFS (any format) goes on the block level and so on. Since block level on the RAID volume sits below everything, all data on that is effected by issues at the block level.

I hope I'm addressing your question properly.