OOM killer killing things with plenty(?) of free RAM

Solution 1:

I just looked at the oom log dump, and I question the accuracy of that graph. Notice the first 'Node 0 DMA32' line. It says free:3376kB, min:3448kB, and low:4308kB. Whenever the free value drops below the low value, kswapd is supposed to start swapping things until that value gets back up above the high value. Whenever free drops below min, the system basically freezes until the kernel gets it back up above the min value. That message also indicates that swap was completely used where it says Free swap = 0kB.
So basically kswapd triggered, but swap was full, so it couldn't do anything, and the pages_free value was still below the pages_min value, so the only option was to start killing things until it could get pages_free back up.
You definitely ran out of memory.

http://web.archive.org/web/20080419012851/http://people.redhat.com/dduval/kernel/min_free_kbytes.html has a really good explanation of how that works. See the 'Implementation' section at the bottom.

Solution 2:

Get rid of the drop_caches script. In addition, you should post the relevant portions of your dmesg and /var/log/messages output showing the OOM messages.

To stop this behavior, though, I'd recommend trying this sysctl tunable. This is a RHEL/CentOS 6 system and is clearly running on constrained resources. Is it a virtual machine?

Try modifying /proc/sys/vm/nr_hugepages and see if the issues persist. This could be a memory fragmentation issue, but see if this setting makes a difference. To make the change permanent, add vm.nr_hugepages = value to your /etc/sysctl.conf and run sysctl -p to reread the config file...

Also see: Interpreting cryptic kernel "page allocation failure" messages

Solution 3:

There is no data available on the graph from when the OOM killer starts until it ends. I believe in the time frame where the graph is interrupted that in fact memory consumption does spike and there is no memory available anymore. Otherwise the OOM killer would not be used. If you watch the free memory graph after the OOM killer has stopped you can see it goes down from a higher value than before. At least it did its job properly, freeing up memory.

Note that your swap space is almost utilized fully until reboot. That is almost never good thing and a sure sign there is little free memory left.

The reason there is no data available for that particular time frame is because the system is too busy with other things. "Funny" values in your process list may just be a result, not a cause. It's not unheard of.

Check /var/log/kern.log and /var/log/messages, what information can you find there?

If logging also stopped then try other things, dump the process list to a file every second or so, same with the system performance information. Run it with high priority so it can still do its job (hopefully) when the load spikes. Although if you don't have a preempt kernel (sometimes indicated as a "server" kernel) you may be out of luck in that regard.

I think you will find that the process(es) that is (are) using the most CPU% around the time your problems start is (are) the cause. I have never seen rsyslogd neither mysql behaving that way. More likely culprits are java apps and gui driven apps such as a browser.