Possible Memory Leak in Ignite DataStreamer

TLDR

Set walSegmentSize=64mb (or just remove the setting and use the default) AND set -XX:MaxDirectMemorySize=<walSegmentSize * 4>.

Explanation

One thing people often forget when calculating Ignite's memory needs is direct memory buffer size.

Direct memory buffers are JVM-managed buffers allocated from a separate space in the Java process - it is neither Java heap, Ignite data region or Ignite checkpoint buffer.

Direct memory buffers are the normal way of interacting with non-heap memory in Java. There is a lot of things that use that (from JVM's internal code to applications) but in Ignite servers the main user of the direct memory pool is write-ahead log.

By default, Ignite writes to WAL using a memory-mapped file - which works through a direct memory buffer. The size of that buffer is the size of the WAL segment. And here we get to the fun stuff.

Your WAL segments are huge! 2GB - it's A LOT. Default is 64mb, and I've rarely seen an environment that would use more than that. In some specific workloads and for some specific disks we would recommend to set 256mb.

So, you have a 2GB buffers that are being created in the direct memory pool. The maximum size of the direct memory by default is equal to -Xmx - in your case, 24GB. I can see a scenario when your direct memory pool would bloat to 24GB (from the non-yet-cleared old buffered), making the total size of your application at least 20 + 2 + 24 + 24 = 70GB!.

This explains the 40GB of internal JVM memory (I think that's the data region + direct). This also explains why you don't see an issue when persistence is off - you don't have WAL in that case.

What to do

  1. Choose a sane walSegmentSize. I don't know the reason behind the 2GB choice but I would recommend to go either for the default of 64mb or for 256mb if you're sure you had issues with small WAL segments.

  2. Set a limit to JVM's direct memory pool via -XX:MaxDirectMemorySize=<size>. I find it a safe choice to set it to the value of walSegmentSize * 4, i.e. somewhere in the range 256mb-1gb.

Even if you see issues with memory consumption after making the above changes - keep them anyway, just because they are the best choice in for 99% of clusters.


The memory leaks seems to be triggered by the @QueryTextField annotation on value object in my cache model, which supports Lucene queries in Ignite.

Originally: case class Value(@(QueryTextField@field) theta: String)

Changing this line to: case class Value(theta: String) seems to solve the problem. I don't have an explanation as to why this works, but maybe somebody with a good understanding of the Ignite code base can explain why.