Deleting large hashmaps with millions of strings on one thread affects performance on another thread

It might be worthwhile to store just a single std::string for all your data combined, and use std::string_view in the map. This eliminates mutex contention as there's only one memory allocation needed. string_view has a trivial destructor so you don't need a thread for that.

I've successfully used this technique before to speed up a program by 2500%, but that was also because this technique reduced the total memory usage.


You can try using a std::vector for storing the memory. std::vector elements are stored contiguously, so it will reduce cache miss (see What is a "cache-friendly" code?)

So you will have a map<???,size_t> instead of map<???,std::string> you will have one more indirection to get your string (wich means an extra run time cost) but it allow you to iterate on all strings with way less cache-miss.


It would be great if you recreate the problem you are encountering with a MVCE and show it: you know, many times the problem you are thinking is your problem... is not the problem.

How can I find for sure the above 2 memory issues are the cause (any tools/metrics?)

Given the information here I would suggest to use a profiler - gprof (compile with -g -pg) being the basic one. If you have the Intel compiler available you can use vtune.

There is a free version of vtune but I have personally used the commercial version only.

Besides from this you can insert timings in your code: from the textual description, it's not clear whether the time to populate the map is comparable to the time needed to erase it, or it grows consistently when run concurrently. I would start with if. Note that the current version of malloc() is greatly optimized for concurrency too (is this Linux? - add a tag to the question please).

For sure when you erase the map there are millions of free()'s called by std::~string() - but you need to be sure that this is the problem or not: you can use a better approach (many mentioned in the answers/comments) or a custom allocator backed by a huge memory block that you create/destroy as a single unit.

If you provide a MVCE as a starting point, I or others will be able to provie a consistent answer (this is not an answer, yet - but too long to be a comment)

Just to clarify, the program deliberately never allocates stuff and frees others at the same time, and it only has 2 threads, one dedicated for just deletion.

Keep in mind that each string in the map needs one (ore more) new and one delete (based on malloc() and free() respectively), being the strings either in the keys or in the values.

What do you have in the "values" of the map?

Since you have a map<string,<set<int>> you have many allocations: Every time you perform a map[string].insert(val) of a new key, your code implicitly call malloc() for both the string and the set. Even if the key is in the map already, a new int in the set require a new node in the set to be allocated.

So you have really many allocations while building the structure: your memory is very fragmented on one side, and your code seems really "malloc intensive", that could in principle lead to the memory calls to starve.

Multithreaded memory allocations/deallocations

One peculiarity of modern memory subsystems, is that thay are optimized for multi-core systems: when one thread allocates memory on one core, there is not a global lock, but a thread-local or core-local lock for a thread-local pool.

This means that when one thread needs to free the memory allocated by another one, there is a non-local (slower) lock involved.

This means that the best approach is that each thread allocates/deallocates its own memory. Said that in principle you can optimize a lot your code with data structures that require less malloc/free interactions, your code will be more local, with respect to memory allocations, if you let each thread:

  • get one block of data
  • build the map<string,<set<int>>
  • free it

And you have two threads that, repeatedly perform this task.

NOTE: you need enoguht RAM to handle concurrent evaluators, but now already you are using 2 of them concurrently loaded with a double buffering scheme (one filling, one cleaning). Are you sure your system is not swapping because of RAM exahustion?

Furthermore, this approach is scalable: you can use as many threads as you want. In your approach you were limited to 2 threads - one building the structure, one destorying it.

Optimizing

Without a MVCE it's a hard task to give directions. Just ideas that you only know whether can be applied at now:

  • replace the set with sorted vector, reserved at creation time
  • replace the map keys with a flat vector of equally spaced, sorted strings
  • store the string keys sequentially in a flat vector, add hashes to keep track of the keys of the map. Add an hash-map to keep track of the ordering of the strings in the vector.