How to flush the CPU cache for a region of address space in Linux?

This is for ARM.

GCC provides __builtin___clear_cache which does should do syscall cacheflush. However it may have its caveats.

Important thing here is Linux provides a system call (ARM specific) to flush caches. You can check Android/Bionic flushcache for how to use this system call. However I'm not sure what kind of guarantees Linux gives when you call it or how it is implemented through its inner workings.

This blog post Caches and Self-Modifying Code may help further.


Check this page for list of available flushing methods in linux kernel: https://www.kernel.org/doc/Documentation/cachetlb.txt

Cache and TLB Flushing Under Linux. David S. Miller

There are set of range flushing functions

2) flush_cache_range(vma, start, end);
   change_range_of_page_tables(mm, start, end);
   flush_tlb_range(vma, start, end);

3) void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)

Here we are flushing a specific range of (user) virtual
addresses from the cache.  After running, there will be no
entries in the cache for 'vma->vm_mm' for virtual addresses in
the range 'start' to 'end-1'.

You can also check implementation of the function - http://lxr.free-electrons.com/ident?a=sh;i=flush_cache_range

For example, in arm - http://lxr.free-electrons.com/source/arch/arm/mm/flush.c?a=sh&v=3.13#L67

 67 void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
 68 {
 69         if (cache_is_vivt()) {
 70                 vivt_flush_cache_range(vma, start, end);
 71                 return;
 72         }
 73 
 74         if (cache_is_vipt_aliasing()) {
 75                 asm(    "mcr    p15, 0, %0, c7, c14, 0\n"
 76                 "       mcr     p15, 0, %0, c7, c10, 4"
 77                     :
 78                     : "r" (0)
 79                     : "cc");
 80         }
 81 
 82         if (vma->vm_flags & VM_EXEC)
 83                 __flush_icache_all();
 84 }

In the x86 version of Linux you also can find a function void clflush_cache_range(void *vaddr, unsigned int size) which is used for the purposes of flush a cache range. This function relies to the CLFLUSH or CLFLUSHOPT instructions. I would recommend checking that your processor actually supports them, because in theory they are optional.

CLFLUSHOPT is weakly ordered. CLFLUSH was originally specified as ordered only by MFENCE, but all CPUs that implement it do so with strong ordering wrt. writes and other CLFLUSH instructions. Intel decided to add a new instruction (CLFLUSHOPT) instead of changing the behaviour of CLFLUSH, and to update the manual to guarantee that future CPUs will implement CLFLUSH as strongly ordered. For this use, you should MFENCE after using either, to make sure that the flushing is done before any loads from your benchmark (not just stores).

Actually x86 provides one more instruction that could be useful: CLWB. CLWB flushes data from cache to memory without (necessarily) evicting it, leaving it clean but still cached. clwb on SKX does evict like clflushopt, though

Note also that these instructions are cache coherent. Their execution will affect all caches of all processors (processor cores) in the system.

All these three instructions are available in user mode. Thus, you can employ assembler (or intrinsics like _mm_clflushopt) and create your own void clflush_cache_range(void *vaddr, unsigned int size) in your user space application (but do not forget to check their availability, before actual use).


If I correctly understand, it is much more difficult to reason about ARM in this regard. Family of ARM-processors is much less consistent then family of IA-32 processors. You can have one ARM with full-featured caches, and another one completely without caches. Further more, many manufacturers can use customized MMUs and MPUs. So it is better to reason about some particular ARM processor model.

Unfortunately, it looks like that it will be almost impossible to perform any reasonable estimation of time required to flush some data. This time is affected by too many factors including the number of cache lines flushed, unordered execution of instructions, the state of TLB (because instruction takes a virtual address as an argument, but caches use physical addresses), number of CPUs in the system, actual load in terms of memory operations on the other processors in the system, and how many lines from the range are actually cached by processors, and finally by performance of CPU, memory, memory controller and memory bus. In a result, I think execution time will vary significantly in different environments and with different loads. The only reasonable way is to measure the flush time on the system and with load similar to the target system.


And final note, do not confuse memory caches and TLB. They are both caches but organized in different ways and serving different purposes. TLB caches just most recently used translations between virtual and physical addresses, but not data which are pointed by that addresses.

And TLB is not coherent, in contrast to memory caches. Be careful, because flushing of TLB entries does not lead to the flushing of appropriate data from memory cache.