mdadm RAID5 random read errors. Dying disk?

Solution 1:

MD raid is far too conservative with kicking out disks, in my opinion. I always watch for ATA exceptions in the syslog/dmesg (I set rsyslog to notify me on those).

I must say I am surprised that you get errors on the application level. RAID5 should use the parity information to detect errors (edit, apparently, it doesn't; only during verification). Having said that, whether the disk is the cause or not, it's bad. Nearly 2000 reallocated sectors is really bad.

Partitions can be bigger, otherwise you can't add them as spare either, but to be sure everything is fine, you can clone partition tables using fdisk, sfdisk and gdisk. You have GPT, so let's use its backup feature. If you do gdisk /dev/sdX, you can use b to back the partition table up to disk. Then, on the new disk, gdisk /dev/sdY, you can use r for recovery options, then l to load the backup. Then you should have an identical partition and all mdadm --manage --add commands should work. (you will need to take out the new disk from the array before changing the partition table)

I actually tend to keep those backup partition tables around on servers. It makes for fast disk replacements.

And, a final piece of advice: don't use RAID5. RAID5 with such huge disks is flaky. You should be able to add a disk and dynamically migrate to RAID6. Not sure how from the top of my head, but you can Google that.

Solution 2:

it’s pretty common to have cron task initiating parity mismatch checks. i’m pretty sure debian 9 does it by default when mdadm package installs and hence your system’s logs would have reports in regards.

Besides if your system’s RAM fails it might be the primary reason