Bash script that automatically kills processes when CPU/memory usage gets too high

I'm guessing the problem you want to solve is that you have some process running on your box which sometimes misbehaves, and sits forever pegging a core.

The first thing you want to do is to attempt to fix the program that goes crazy. That is by far the best solution. I'm going to assume that isn't possible, or you need a quick kluge to keep your box running until its fixed.

You, at minimum, want to limit your script to only hit the one program you're concerned about. It'd be best if permissions limited your script like this (e.g., your script runs as user X, the only other thing running as X is the program).

Even better would be to use something like ulimit -t to limit the amount of total CPU time that the program can use. Similarly, if it consumes all memory, check ulimit -v. The kernel enforces these limits; see the bash manpage (it's a shell built-in) and the setrlimit(2) manpage for details.

If the problem isn't a process running amok, but is instead just too many processes running, then implement some form of locking to prevent more than X from running (or—this should be getting familiar—ulimit -u). You may also consider changing the scheduler priority of those processes (using nice or renice), or for even more drastic, using sched_setscheduler to change the policy to SCHED_IDLE.

If you need even more control, take a look a control groups (cgroups). Depending on the kernel you're running, you can actually limit the amount of CPU time, memory, I/O, etc. that a whole group of processes together consume. Control groups are quite flexible; they can likely do whatever you're trying to do, without any fragile kluges. The Arch Linux Wiki has an intro to cgroups that's worth reading, as is Neil Brown's cgroups series at LWN.


Issues:

  • When sorting numeric fields you probably want to use the -n option: sort -nrk 2. Otherwise a line with a %CPU value of 5.0 will end up higher than one with a value of 12.0.
  • Depending on your ps implementation you might want to use the --no-headers option to get rid of the grep -v. That prevents you from discarding commands that contains PID.
  • I guess instead of echo CPU USAGE is at $CPU_LOAD, you meant echo CPU USAGE is at $CPU_USAGE.
  • I guess you forgot to remove the exit 0 that you inserted during debugging(?).

Style:

  • You might want to move the CPU_USAGE_THRESHOLD=800 line to the beginning of the file, as this the most informative thing and is most likely to get changed even after your script is stable.
  • You are repeating the -e option: ps -eo pid -eo pcpu -eo command is the same as ps -eo pid -o pcpu -o command (as is ps -eo pid,pcpu,command).
  • There is an empty else clause. That always looks as if it should be handled, but was not for some unknown reason.

Killing off processes which are using most CPU/memory is asking for trouble: Just look at what they are right now on your machine (here currently firefox, systemd (init), Xorg, gnome-terminal, a set of kernel threads, xemacs; none of which is dispensable). Look at how to tweak Linux' OOM-killer, for example here.

Also note that "memory used by the process" is a nebulous concept, as there are shared libraries, executables are shared, and even parts of data areas. One can come up with some number by charging each user with a fraction of the used space, but even adding that up really doesn't give "memory used" (even less "memory freed if the process goes away", the parts shared stay behind).