How to interpret this smartctl (smartmon) data

Solution 1:

For Seagate disks (and possibly some old ones from WD too) the Seek_Error_Rate and Raw_Read_Error_Rate are 48 bit numbers, where the most significant 16 bits are an error count, and the low 32 bits are a number of operations.

% python
>>> 200009354607 & 0xFFFFFFFF
>>> (200009354607 & 0xFFFF00000000) >> 32

So your disk has performed 2440858991 seeks, of which 46 failed. My experience with Seagate drives is that they tend to fail when the number of errors goes over 1000. YMMV.

Solution 2:

The "seek error rate" and "raw read error rate" RAW_VALUES are virtually meaningless for anyone but Seagate's support. As others pointed out, raw values of parameters like "reallocated sector count" or entries in the drive's error log are more likely to indicate a higher probability of failure.

But you can take a look at the interpreted data in the VALUE, WORST and THRESH columns which are meant to be read as gauges:

  7 Seek_Error_Rate         0x000f   077   060   030

Meaning that your seek error rate is currently considered to be "77% good" and is reported as a problem by SMART when it reaches "30% good". It had been as low as "60% good" once, but has magically recovered since. Note that the interpreted values are calculated by the drive's SMART logic internally and the exact calculation may or may not be published by the manufacturer and typically cannot be tweaked by the user.

Personally, I consider a drive containing error log entries as "failing" and urge for a replacement as soon as they occur. But all in all, SMART data has turned out to be a rather weak indicator for failure prediction, as a research paper published by Google uncovered.

Solution 3:

In my experience, Seagates have weird numbers for those two SMART attributes. When diagnosing a Seagate I tend to ignore those and look more closely at other fields like Reallocated Sector Count. Of course, when in doubt replace the drive, but even brand new Seagates will have high numbers for those attributes.

Solution 4:

I realized this discussion is a bit old but want to add my 2 cents. I have found the smart information to be quite a good indicator of pre-fail. When you get a smart threshold tripped then replace the drive. That is what those thresholds are for.

The vast majority of time you will start to see bad sectors. That is a sure sign the drive is starting to fail. SMART has saved me many times. I use software RAID 1 and it's very helpful since you simply replace the failing drive and rebuild the array.

I also run short and long self test weekly.

smartctl -t short /dev/sda
smartctl -t long /dev/sda 

Or add it /etc/smartd.conf and get it to email you if there are errors

/dev/sda -s L/../../3/22 -I 194 -m someemail@somedomain
/dev/sdb -s L/../../7/22 -I 194 -m someemail@somedomain

Make sure to install logwatch and redirect root to an email address and check the daily emails from logwatch. SMARTD tripped flags will show up there but it's of no help if nobody is monitoring that regularly.

Solution 5:

Sorry to commit necromancy on this post, but in my experience, the "Raw Read Error Rate" and "Hardware ECC Recovered" fields for a Seagate drive will quite literally go all over the place and increment constantly into the trillions range at which point they'll cycle back around to zero to continue the process again. I've a Seagate ST9750420AS that has had that problem since day one and still works great even after quite a few years and 3500+ hours of use.

I think those fields can be safely ignored if you're running one in your case. Just make sure the two fields are reporting the same number and in sync constantly. If they're not...well... That actually might mean a problem.