Deciphering continuing mpt2sas syslog messages

Solution 1:

Likely your best bet is a hardware problem somewhere between your disks and up to and including your sas raid controller. I recommend trying:

  1. Run any diagnostic tools from the vendor/s if they are available
  2. Check/re-seat/replace cables
  3. strip out hardware components and swap out hardware in the chain that connects the disks to your raid controller, including the controller itself (i.e., for you, try something else than the motherboard integrated raid).

I had one out of two identical Dell PowerEdge R515 giving very similar messages (logs periodically filling up with mpt2sas0 messages, though I do not have the exact numeric codes). Dell's own bootable diagnostic picked these up as "hardware errors" and replacing the RAID sas backplane solved the issue.

When I was investigating, I could not find a comprehensive resource of what various mpt2sas0 error codes mean. I suspect they may even be hardware-vendor-specific (someone who knows more about SAS needs to confirm or deny this). So your error codes could mean something widely different, but if SMART is clean it is hard to imagine other good reasons for mpt2sas0 to report error codes.

These errors can be very serious. My R515 worked seemingly OK with these messages for a week with a 12 disk Ubuntu Linux software raid 6, but then suddenly ejected all 12 disks out of the array as broken (!)

Also in my case the SMART for all disks were completely clean. A good check is a smart self diagnostic test: smartctl -t long /dev/sdX, and then check the results about a day later with smartctl -l selftest /dev/sdX. If all is OK the test should say Completed and the LBA_first_err column should be empty.

Solution 2:

Wow, a tough one.

This seems to indicate that 0x31120303 is a bus reset due to one of your devices being under heavy load. It also says you don't need to worry about it. (Haha, yeah right.)

This indicates that these log messages are happening because one of your devices is taking too long to respond to commands. This says the same thing, and also indicates it occurs under heavy load.

While this isn't a complete answer, it hopefully will point you in a useful direction.