Does Linux perform "opportunistic swapping", or is it a myth?

Linux does not do "opportunistic swapping" as defined in this question.


The following primary references do not mention the concept at all:

  1. Understanding the Linux Virtual Memory Manager. An online book by Mel Gorman. Written in 2003, just before the release of Linux 2.6.0.
  2. Documentation/admin-guide/sysctl/vm.rst. This is the primary documentation of the tunable settings of Linux virtual memory management.

More specifically:

10.6 Pageout Daemon (kswapd)

Historically kswapd used to wake up every 10 seconds but now it is only woken by the physical page allocator when the pages_low number of free pages in a zone is reached. [...] Under extreme memory pressure, processes will do the work of kswapd synchronously. [...] kswapd keeps freeing pages until the pages_high watermark is reached.

Based on the above, we would not expect any swapping when the number of free pages is higher than the "high watermark".

Secondly, this tells us the purpose of kswapd is to make more free pages.

When kswapd writes a memory page to swap, it immediately frees the memory page. kswapd does not keep a copy of the swapped page in memory.

Linux 2.6 uses the "rmap" to free the page. In Linux 2.4, the story was more complex. When a page was shared by multiple processes, kswapd was not able to free it immediately. This is ancient history. All of the linked posts are about Linux 2.6 or above.

swappiness

This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase aggressiveness, lower values decrease the amount of swap. A value of 0 instructs the kernel not to initiate swap until the amount of free and file-backed pages is less than the high water mark in a zone.

This quote describes a special case: if you configure the swappiness value to be 0. In this case, we should additionally not expect any swapping until the number of cache pages has fallen to the high watermark. In other words, the kernel will try to discard almost all file cache before it starts swapping. (This might cause massive slowdowns. You need to have some file cache! The file cache is used to hold the code of all your running programs :-)

What are the watermarks?

The above quotes raise the question: How large are the "watermark" memory reservations on my system? Answer: on a "small" system, the default zone watermarks might be as high as 3% of memory. This is due to the calculation of the "min" watermark. On larger systems the watermarks will be a smaller proportion, approaching 0.3% of memory.

So if the question is about a system with more than 10% free memory, the exact details of this watermark logic are not significant.

The watermarks for each individual "zone" are shown in /proc/zoneinfo, as documented in proc(5). An extract from my zoneinfo:

Node 0, zone    DMA32
  pages free     304988
        min      7250
        low      9062
        high     10874
        spanned  1044480
        present  888973
        managed  872457
        protection: (0, 0, 4424, 4424, 4424)
...
Node 0, zone   Normal
  pages free     11977
        min      9611
        low      12013
        high     14415
        spanned  1173504
        present  1173504
        managed  1134236
        protection: (0, 0, 0, 0, 0)

The current "watermarks" are min, low, and high. If a program ever asks for enough memory to reduce free below min, the program enters "direct reclaim". The program is made to wait while the kernel frees up memory.

We want to avoid direct reclaim if possible. So if free would dip below the low watermark, the kernel wakes kswapd. kswapd frees memory by swapping and/or dropping caches, until free is above high again.


Additional qualification: kswapd will also run to protect the full lowmem_reserve amount, for kernel lowmem and DMA usage. The default lowmem_reserve is about 1/256 of the first 4GiB of RAM (DMA32 zone), so it is usually around 16MiB.

Linux code commits

mm: scale kswapd watermarks in proportion to memory

[...]

watermark_scale_factor:

This factor controls the aggressiveness of kswapd. It defines the amount of memory left in a node/system before kswapd is woken up and how much memory needs to be free before kswapd goes back to sleep.

The unit is in fractions of 10,000. The default value of 10 means the distances between watermarks are 0.1% of the available memory in the node/system. The maximum value is 1000, or 10% of memory.

A high rate of threads entering direct reclaim (allocstall) or kswapd going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate that the number of free pages kswapd maintains for latency reasons is too small for the allocation bursts occurring in the system. This knob can then be used to tune kswapd aggressiveness accordingly.

proc: meminfo: estimate available memory more conservatively

The MemAvailable item in /proc/meminfo is to give users a hint of how much memory is allocatable without causing swapping, so it excludes the zones' low watermarks as unavailable to userspace.

However, for a userspace allocation, kswapd will actually reclaim until the free pages hit a combination of the high watermark and the page allocator's lowmem protection that keeps a certain amount of DMA and DMA32 memory from userspace as well.

Subtract the full amount we know to be unavailable to userspace from the number of free pages when calculating MemAvailable.

Linux code

It is sometimes claimed that changing swappiness to 0 will effectively disable "opportunistic swapping". This provides an interesting avenue of investigation. If there is something called "opportunistic swapping", and it can be tuned by swappiness, then we could chase it down by finding all the call-chains that read vm_swappiness. Note we can reduce our search space by assuming CONFIG_MEMCG is not set (i.e. "memory cgroups" are disabled). The call chain goes:

  • vm_swappiness
  • mem_cgroup_swappiness
  • get_scan_count
  • shrink_node_memcg
  • shrink_node

shrink_node_memcg is commented "This is a basic per-node page freer. Used by both kswapd and direct reclaim". I.e. this function increases the number of free pages. It is not trying to duplicate pages to swap so they can be freed at a much later time. But even if we discount that:

The above chain is called from three different functions, shown below. As expected, we can divide the call-sites into direct reclaim v.s. kswapd. It would not make sense to perform "opportunistic swapping" in direct reclaim.

  1. /*
     * This is the direct reclaim path, for page-allocating processes.  We only
     * try to reclaim pages from zones which will satisfy the caller's allocation
     * request.
     *
     * If a zone is deemed to be full of pinned pages then just give it a light
     * scan then give up on it.
     */
    static void shrink_zones
    
  2.  * kswapd shrinks a node of pages that are at or below the highest usable
     * zone that is currently unbalanced.
     *
     * Returns true if kswapd scanned at least the requested number of pages to
     * reclaim or if the lack of progress was due to pages under writeback.
     * This is used to determine if the scanning priority needs to be raised.
     */
    static bool kswapd_shrink_node
    
  3.  * For kswapd, balance_pgdat() will reclaim pages across a node from zones
     * that are eligible for use by the caller until at least one zone is
     * balanced.
     *
     * Returns the order kswapd finished reclaiming at.
     *
     * kswapd scans the zones in the highmem->normal->dma direction.  It skips
     * zones which have free_pages > high_wmark_pages(zone), but once a zone is
     * found to have free_pages <= high_wmark_pages(zone), any page in that zone
     * or lower is eligible for reclaim until at least one usable zone is
     * balanced.
     */
    static int balance_pgdat
    

So, presumably the claim is that kswapd is woken up somehow, even when all memory allocations are being satisfied immediately from free memory. I looked through the uses of wake_up_interruptible(&pgdat->kswapd_wait), and I am not seeing any wakeups like this.


No, there is no such thing as opportunistic swapping in Linux. I've spent some time looking at the issue and all the sources (textbooks, emails on kernel developers' mail lists, Linux source code and commit comments, and some Twitter exchanges with Mel Gorman) are telling me the same thing: Linux only reclaims memory in response to some form of memory pressure (with the obvious exception of hibernation).

All the popular misconceptions on the subject probably stem from the simple fact that Linux can't afford to wait until the last byte of free memory before starting swapping. It needs some sort of cushion to protect it from extreme forms of memory depletion, and there are some tunables that can affect the size of that cushion (e.g. vm.min_free_kbytes). But it's not the same as "swapping because there's nothing better to do".

Unfortunately the page frame reclamation algorithm has grown much more complex with respect to 2.6 (when it was described in detail in Mel Gorman's book), but the basic idea is more or less the same: page reclamation is triggered by failed allocations, which then either wake up kswapd or try to free pages synchronously (depending on memory pressure, allocation flags and other factors).

The most obvious reason why page allocations may start failing with enough free memory remaining is that they may be asking for contiguous memory while in reality the memory may too fragmented to satisfy the request. Historically, Linux kernel developers went to great lengths to avoid the need for contiguous allocations. Nevertheless, some device drivers still require that -- either because they can't do multipage memory I/O (scatter-gather DMA), or it could just be sloppy coding by driver developers. The advent of Transparent Huge Pages (THP) provided another reason for allocating memory in physically contiguous chunks.

Zone compaction, which was introduced around the same time frame, is supposed to help with the memory fragmentation problem, but it doesn't always produce the expected effect.

There are various vmscan tracepoints that can help understand what exactly is going on in your specific case -- it's always easier to find the stuff you need in Linux kernel code when having specific call stacks, rather than just scanning everything looking remotely relevant.

Tags:

Linux

Memory

Swap