Please help interpret OOM-Killer

Solution 1:

Out of memory.

Dec 18 23:24:59 ip-10-0-3-36 kernel: [ 775.566936] Out of memory: Kill process 4973 (java) score 0 or sacrifice child

From the same log (ps);

[ 775.561798] [ 4973] 500 4973 4295425981 2435 71 5 0 0 java

4295425.981 is around 4TB. and the line total-vm:17181703924kB show around a 17TB.

Can you debug your memory allocation routine ? as for me your application got a bad loop somewhere and must take all the ressource available, and the swap available too.

Solution 2:

Dec 18 23:24:59 ip-10-0-3-36 kernel: [  775.214705]  shmem_fallocate+0x32d/0x440
Dec 18 23:24:59 ip-10-0-3-36 kernel: [  775.217182]  vfs_fallocate+0x13f/0x260
Dec 18 23:24:59 ip-10-0-3-36 kernel: [  775.219525]  SyS_fallocate+0x43/0x80
Dec 18 23:24:59 ip-10-0-3-36 kernel: [  775.221657]  do_syscall_64+0x67/0x100

Your application process is trying to invoke fallocate on shmem filesystem. From quick googling it looks like ZGC uses fallocate to grab initial heap memory from shm filesystem and proceeds to use fallocate for expanding heap. Such use of fallocate syscall is rather unusual, so either this is a ZGC bug (as you already suspected) or something else is leaking lots of memory, which causes heap expansion to fail.

I suggest, that you configure ZGC to avoid additional runtime allocations (set Xms and Xmx to same value). This might not solve your problem, if the memory leak happens because of something unrelated, but at least you would have a better chance to find the real culprit.

Note, that your overall setup is somewhat dangerous — ZGC apparently likes to have a lot of contiguous memory, but if you have 190G heap on 240G RAM machine, there might not be a sufficiently big contiguous region to fallocate from. In that case ZGC will fall back to picking up small memory regions with further fallocate calls (see description of linked bug report), and the issue will get obscured again... Enable hugepages support in JVM (normal hugepages, not transparent hugepages!) and preallocate hugepages during boot (with kernel argument) — using hugepages is advisable for your heap sizes anyway.