How to diagnose causes of oom-killer killing processes

Solution 1:

No, the algorithm is not that simplistic. You can find more information in:

If you want to track memory usage, I'd recommend running a command like:

ps -e -o pid,user,cpu,size,rss,cmd --sort -size,-rss | head

It will give you a list of the processes that are using the most memory (and probably causing the OOM situation). Remove the | head if you'd prefer to check all the processes.

If you put this on your cron, repeat it every 5 minutes and save it to a file. Keep at least a couple of days, so you can check what happened later.

For critical services like ssh, I'd recommend using monit for auto restarting them in such a situation. It might save from losing access to the machine if you don't have a remote console to it.

Best of luck,
João Miguel Neves

Solution 2:

I had a hard time with that recently, because the process(es) that the oom-killer stomps on aren't necessarily the ones that have gone awry. While trying to diagnose that, I learned about one of my now-favorite tools, atop.

This utility is like a top on steroids. Over a pre-set time interval, it profiles system information. You can then play it back to see what's going on. It highlights processes that ar 80%+ in blue and 90%+ in red. The most useful view is a memory usage table of how much memory was allocated in the last time period. That's the one that helped me the most.

Fantastic tool -- can't say enough about it.

atop performance monitor