How do I get my HP servers to email me when a drive fails?

Solution 1:

This depends slightly on the operating systems you're running on the servers, but in general, it is possible to obtain alerts from HP ProLiant servers and Smart Array RAID controllers.

The full driver and software support listing for your DL380 G5 systems is listed here.

SNMP and a monitoring solution is the best approach... But you can augment that with some of HP's tools. HP offers the HP Systems Insight Manager, which is available for download and also comes with the servers. This is ideal for collections of servers. If you're looking for one-off alerts without building a management or monitoring infrastructure, you can simply install the HP Management Agents (aka ProLiant Support Pack).

For standalone Linux systems, I'll have the agents send traps via email. I'll usually configure the support pack with defaults or a custom bundle, then edit /opt/hp/hp-snmp-agents/cma.conf and change the trapemail line to point to the recipient address:

########################################################################
# trapemail is used for configuring email command(s) which will be
# executed whenever a SNMP trap is generated.
# Multiple trapemail lines are allowed.
# Note: any command that reads standard input can be used. For example:
#             trapemail /usr/bin/logger
#       will log trap messages into system log (/var/log/messages).
########################################################################
trapemail /bin/mail -s 'HP Insight Management Agents Trap Alarm' [email protected]

If you're running Linux and don't want to install the full HP management suite, you can develop a script around the cciss_vol_status utility to query controller/disk status. Also see: Installing HP Agents on OpenFiler

Solution 2:

Check out HP Insight Manager

https://www.hpe.com/us/en/product-catalog/detail/pip.489496.html#

I believe it should work with your Servers.


Solution 3:

I used the lightweight program that @ewwite mentioned in his answer: cciss_vol_status

If you follow the accompanying INSTALL instructions, the script is placed in /usr/local/bin/cciss_vol_status.

Here is a wrapper script I use to grep the output of cciss_vol_status, and send an email if any array has a status of FAILED.

#!/bin/bash
#
# Check status of RAID volumes on HP Smart Array controllers.  Send an email
# alert if any volumes have a FAILED status.
#
status=`/usr/local/bin/cciss_vol_status /dev/sd*`

# email lock file
lockfile=/tmp/raid.check.hp.smartarray.lock
# how often to send an email (minutes)
_notification_freq=59
_host=`hostname`
# To: email
_toemail=root

# create email lock file
[ ! -f ${lockfile} ] && /bin/touch ${lockfile}

if echo $status | grep -q FAILED
then
    # make sure we haven't sent a notification in the last X minutes
    if test `find ${lockfile} -mmin +${_notification_freq}`
    then
        echo -e "${status}" | /bin/mail -s "System Alert! RAID failure on ${_host}" ${_toemail}

        # update lock file mod time
        /bin/touch ${lockfile}
    fi
fi

Call the above script in cron. I run the check every two minutes:

*/2 * * * * /usr/local/bin/raid.check.hp.smartarray.sh

We do use HP System Insight Manager to check if our HP's are up and running, but nothing beyond that. I found the Linux agent to be overkill for us, since we have other monitoring solutions in place, so this script above serves its specific purpose well.

UPDATE

Just a troubleshooting tip in case you run into this. This script proved helpful this morning when I got an email about a failed array with:

Cache dirty limit reached

The device went read-only and was not visible in /proc/partitions. I rebooted the server and saw these messages on boot:

Logical drive(s) disabled due to possible data loss. Select "F1" to continue with logical drive(s) disabled Select "F2" to accept data loss and to re-enable logical drive(s)

I selected F2 and the RAID was fine and mounted on boot.