Error in SLURM cluster - Detected 1 oom-kill event(s): how to improve running jobs

Here OOM stands for "Out of Memory". When Linux runs low on memory, it will "oom-kill" a process to keep critical processes running. It looks like slurmstepd detected that your process was oom-killed. Oracle has a nice explanation of this mechanism.

If you had requested more memory than you were allowed, the process would not have been allocated to a node and computation would not have started. It looks like you need to request more memory.


The approved answer is correct but, to be more precise, error

slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup.

indicates that you are low on Linux's CPU RAM memory.

If you were, for instance, running some computation on GPU, requesting more GPU memory than what is available will result in an error like this (example for PyTorch):

RuntimeError: CUDA out of memory. Tried to allocate 8.94 GiB (GPU 0; 15.90 GiB total capacity; 8.94 GiB already allocated; 6.34 GiB free; 0 bytes cached)

Check out the explanation in this article for more details.

Solution: Increase or add in your script parameter --mem-per-cpu.

1) If you are using sbatch: sbatch your_script.sh to run your script, add in it following line:

#SBATCH --mem-per-cpu=<value bigger than you've requested before>

2) If you are using sran: sran python3 your_script.py add this parameter like this:

sran --mem-per-cpu=<value bigger than you've requested before> python3 your_script.py