What Warning and Critical values to use for check_load?
Though its an old post, replying now because I knew check_load threshold values are bigtime headache for the newbies.. ;)
A warning alert, if CPU is 70% for 5min, 60% for 10mins, 50% for 15mins. A critical alert, if CPU is 90% for 5min, 80% for 10mins, 70% for 15mins.
*command[check_load]=/usr/local/nagios/libexec/check_load -w 0.7,0.6,0.5 -c 0.9,0.8,0.7*
All my findings about CPU load:
Whats meant by "the load": Wikipedia says:
All Unix and Unix-like systems generate a metric of three "load average" numbers in the kernel. Users can easily query the current result from a Unix shell by running the uptime command:
$ uptime 14:34:03 up 10:43, 4 users, load average: 0.06, 0.11, 0.09
From the above output load average:
0.06, 0.11, 0.09 means (on a single-CPU system):
- during the last minute, the CPU was underloaded by 6%
- during the last 5 minutes, the CPU was underloaded 11%
- during the last 15 minutes, the CPU was underloaded 9%
$ uptime 14:34:03 up 10:43, 4 users, load average: 1.73, 0.50, 7.98
The above load average of
1.73 0.50 7.98 on a single-CPU system as:
- during the last minute, the CPU was overloaded by 73% (1 CPU with 1.73 runnable processes, so that 0.73 processes had to wait for a turn)
- during the last 5 minutes, the CPU was underloaded 50% (no processes had to wait for a turn)
- during the last 15 minutes, the CPU was overloaded 698% (1 CPU with 7.98 runnable processes, so that 6.98 processes had to wait for a turn)
Nagios threshold value calculation:
For Nagios CPU Load setup, which includes warning and critical:
y = c * p / 100
y = nagios value
c = number of cores
p = wanted load procent
for a 4 core system:
time 5 min 10 min 15 min warning: 90% 70% 50% critical: 100% 80% 60% command[check_load]=/usr/local/nagios/libexec/check_load -w 3.6,2.8,2.0 -c 4.0,3.2,2.4
For a single core system:
y = p / 100
y = nagios value
p = wanted load procent
time 5 min 10 min 15 min warning: 70% 60% 50% critical: 90% 80% 70% command[check_load]=/usr/local/nagios/libexec/check_load -w 0.7,0.6,0.5 -c 0.9,0.8,0.7
A great white paper about CPU Load analysis by Dr. Gunther http://www.teamquest.com/pdfs/whitepaper/ldavg1.pdf In this online article Dr. Gunther digs down into the UNIX kernel to find out how load averages (the “LA Triplets”) are calculated and how appropriate they are as capacity planning metrics.
Linux load is actually simple. Each of the load avg numbers are the summation of all the core's avg load. Ie.
1 min load avg = load_core_1 + load_core_2 + ... + load_core_n 5 min load avg = load_core_1 + load_core_2 + ... + load_core_n 15 min load avg = load_core_1 + load_core_2 + ... + load_core_n
0 < avg load < infinity.
So if a load is 1 on a 4 core server, then it either means each core is used 25% or one core is 100% under load. A load of 4 means all 4 cores are under 100% load. A load of >4 means the server needs more cores.
check_load now have
-r, --percpu Divide the load averages by the number of CPUs (when possible)
which means that when used, you can think of your server as having just one core and hence write the percent fractions directly without thinking of number of cores. With
-r the warning and critical intervals becomes
0 <= load avg <= 1. Ie. you don't have to modify your warning and critical values from server to server.
OP have 5,10,15 for intervals. That is wrong. It is 1,5,15.
Unless the servers in question have an asynchronous workload where queue depth is the important service metric to manage then its honestly not even worth monitoring load average. Its just a distraction from the metrics that matter like service time (service time, and service time).
A good complement too Nagios is a tool like Munin or Cacti, they will graph the different kinds of workload your server is experiencing. Be it load_average, cpu usage, disk io or something else.
Using this information it is easier to set good threshold values in Nagios.