Why does my Linux system stutter unless I continuously drop caches?

It sounds like you've already tried many of the things I would have suggested at first (tweaking swap configuration, changing I/O schedulers, etc).

Aside from what you've already tried changing, I would suggest looking into changing the somewhat brain-dead defaults for the VM writeback behavior. This is managed by the following six sysctl values:

  • vm.dirty_ratio: Controls how much writes must be pending for writeback before it will be triggered. Handles foreground (per-process) writeback, and is expressed as a integer percentage of RAM. Defaults to 10% of RAM
  • vm.dirty_background_ratio: Controls how much writes must be pending for writeback before it will be triggered. Handles background (system-wide) writeback, and is expressed as a integer percentage of RAM. Defaults to 20% of RAM
  • vm.dirty_bytes: Same as vm.dirty_ratio, except expressed as a total number of bytes. Either this or vm.dirty_ratio will be used, whichever was written to last.
  • vm.dirty_background_bytes: Same as vm.dirty_background_ratio, except expressed as a total number of bytes. Either this or vm.dirty_background_ratio will be used, whichever was written to last.
  • vm.dirty_expire_centisecs: How many hundredths of a second must pass before pending writeback starts when the above four sysctl values would not already trigger it. Defaults to 100 (one second).
  • vm.dirty_writeback_centisecs: How often (in hundredths of a second) the kernel will evaluate dirty pages for writeback. Defaults to 10 (one tenth of a second).

So, with the default values, every tenth of a second, the kernel will do the following:

  • Write out any modified pages to persistent storage if they were last modified more than a second ago.
  • Write out all modified pages for a process if it's total amount of modified memory that hasn't been written out exceeds 10% of RAM.
  • Write out all modified pages in the system if the total amount of modified memory that hasn't been written out exceeds 20% of RAM.

So, it should be pretty easy to see why the default values may be causing issues for you, because your system might be trying to write out up to 4 gigabytes of data to persistent storage every tenth of a second.

The general consensus these days is to adjust vm.dirty_ratio to be 1% of RAM, and vm.dirty_background_ratio to be 2%, which for systems with less than about 64GB of RAM results in behavior equivalent to what was originally intended.

Some other things to look into:

  • Try increasing the vm.vfs_cache_pressure sysctl a bit. This controls how aggressively the kernel reclaims memory from the filesystem cache when it needs RAM. The default is 100, don't lower it to anything below 50 (you will get really bad behavior if you go below 50, including OOM conditions), and don't raise it to much more than about 200 (much higher, and the kernel will waste time trying to reclaim memory it really can't). I've found that bumping it up to 150 actually visibly improves responsiveness if you have reasonably fast storage.
  • Try changing the memory overcommit mode. This can be done by altering the value of the vm.overcommit_memory sysctl. By default, the kernel will use a heuristic approach to try and predict how much RAM it can actually afford to commit. Setting this to 1 disables the heuristic and tells the kernel to act like it has infinite memory. Setting this to 2 tells the kernel to not commit to more memory than the total amount of swap space on the system plus a percentage of actual RAM (controlled by vm.overcommit_ratio).
  • Try tweaking the vm.page-cluster sysctl. This controls how many pages get swapped in or out at a time (it's a base-2 logarithmic value, so the default of 3 translates to 8 pages). If you're actually swapping, this can help improve the performance of swapping pages in and out.

The issue has been found!

It turns out it's a performance issue in Linux's memory reclaimer when there are a large number of containers/memory cgroups. (Disclaimer: my explanation might be flawed, I'm not a kernel dev.) The issue has been fixed in 4.19-rc1+ in this patch set:

This patcheset solves the problem with slow shrink_slab() occuring on the machines having many shrinkers and memory cgroups (i.e., with many containers). The problem is complexity of shrink_slab() is O(n^2) and it grows too fast with the growth of containers numbers.

Let us have 200 containers, and every container has 10 mounts and 10 cgroups. All container tasks are isolated, and they don't touch foreign containers mounts.

In case of global reclaim, a task has to iterate all over the memcgs and to call all the memcg-aware shrinkers for all of them. This means, the task has to visit 200 * 10 = 2000 shrinkers for every memcg, and since there are 2000 memcgs, the total calls of do_shrink_slab() are 2000 * 2000 = 4000000.

My system was hit particularly hard, as I run a good number of containers, which likely was what was causing the issue to appear.

My troubleshooting steps, in case they are helpful to anyone facing similar issues:

  1. Notice kswapd0 using a ton of CPU when my computer stutters
  2. Try stopping Docker containers and filling memory again → the computer doesn't stutter!
  3. Run ftrace (following Julia Evan's magnificent explanation blog) to get a trace, see that kswapd0 tends to get stuck in shrink_slab, super_cache_count, and list_lru_count_one.
  4. Google shrink_slab lru slow, find the patchset!
  5. Switch to Linux 4.19-rc3 and verify the issue is fixed.