For-loop efficiency: merging loops

If I had to hazard a guess I would say that what you are seeing is the result of more frequent memory cache misses in the first function.

myFunc1() is essentially performing 10e8 memory writes in a random-access manner.

myFunc2() is performing 10x sequential memory writes of 10e7 words.

On a modern memory architecture I'd expect the second to be more efficient.

There are three important things here:

1) Benchmarking without optimization is meaningless. It turns out that there's a real effect under this which doesn't go away with optimization. In fact, an anti-optimized debug build was hiding a lot of the difference under the extra cost of storing loop counters in memory (limiting loops to 1 per 6 clocks vs. 1 per clock), plus not auto-vectorizing the store loops.

If you didn't already know the asm + CPU microarchitectural details of why there's a speed diff, it wasn't safe or useful to measure it with optimization disabled.

2) Cache conflict misses (if the arrays are all aligned the same relative to a page boundary). Skewing the arrays relative to each other could help a lot. This can happen naturally depending on how they're allocated, even if their sizes aren't large powers of 2.

The arrays are all large and were allocated separately with new, so they're probably all page-aligned (or offset by 16B from a page boundary in implementations that put some info (like a size) before the object). On Linux, glibc malloc/new typically handles large allocations by allocating fresh pages from the OS with mmap() (and using the first 16 bytes for bookkeeping for that block), rather than moving the brk().

4k aliasing means they all go to the same set in a typical L1d cache, which is 8-way associative on typical x86 CPUs. Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? explains why it's not a coincidence that 64 sets * 64B/line = 4096B page-size (times 8-way = 32kiB), because that makes VIPT L1d cache work like a PIPT without homonym/synonym problems. See also Which cache mapping technique is used in intel core i7 processor?

The 9th store will evict the cache line from the 1st store, so lines will be evicted once per each store, not fully written like in the contiguous case. (Unless the compiler auto-vectorizes and does a whole cache-line full of stores to one array before moving on.) x86's strongly-ordered memory model requires committing stores from the store buffer to L1d in program order, so it can't merge non-adjacent stores to the same line into one entry before commit, or commit multiple outstanding stores when a line does come in if they're not consecutive.

(The replacement policy is pseudo-LRU, not true LRU, so you might sometimes find that a line is still hot after 8 or 9 evictions in the same set.)

Reminder: the above only applies if all the arrays have the same alignment relative to a page. Over-allocating and doing ptr = 128 + malloc(128 + size) for one of the pointers can skew it relative to the others, and this is sometimes worth doing.

You say you have a PC, so I'm guessing an Intel CPU. (Ryzen's L1d has the same geometry, but Bulldozer-family doesn't.)

(Intel's optimization manual section 3.6.10 Write Combining recommends loop fission for loops that write more than 4 output streams This advice is in a section about NT stores and WC memory; it may only be intended to apply to that case. Either way 4 isn't the right number for modern Intel, unless you're being conservative to account for the other hyperthread.

(Intel's) Assembly/Compiler Coding Rule 58. (H impact, L generality) If an inner loop writes to more than four arrays (four distinct cache lines), apply loop fission to break up the body of the loop such that only four arrays are being written to in each iteration of each of the resulting loops.

TL:DR: for NT stores (cache bypassing), up to 12 output streams seems ok on Skylake and newer, or 10 on Broadwell/Haswell and older. (Or fewer if you're reading any memory at the same time). That's the number of LFBs (Line Fill Buffers) on those CPUs. Earlier CPUs (before Nehalem) had fewer than 10, and maybe couldn't use all of them for NT stores. (Where is the Write-Combining Buffer located? x86) LFBs are used for all transfers of lines to/from L1d, so e.g. a pending load miss needs an LFB allocated to be waiting for that line from L2.

(With hyperthreading, keep in mind that the other hyperthread is competing for LFBs on the same physical core, so don't depend on using all 12 LFBs unless you can disable HT.)

But you're not doing NT stores.

The conventional wisdom was that this 4-output efficiency limit applied to normal (non-NT) stores to WB memory as well, but that is not the case on modern Intel. It was a coincidence that performance for normal (WB = write-back) stores fell off at about the same number of output streams as for NT stores. That mechanical sympathy article takes some guesses at the reason, but we're pretty sure they don't sound right.

See https://github.com/Kobzol/hardware-effects/issues/1 for some microbenchmarks. (And see discussion between myself, BeeOnRope, and Hadi Brais about LFBs where this 4-output guideline came up: https://chat.stackoverflow.com/transcript/message/45474939#45474939 which was previously in comments under Size of store buffers on Intel hardware? What exactly is a store buffer?

@BeeOnRope also posted a bar graph for regular (non-NT) stores interleaved to 1 to 15 output streams on Skylake. Performance is somewhat constant for any number of streams up to about 6 on Skylake, then it starts to get worse at 7 and 8 (maybe from L1d conflict misses if the arrays were all aligned the same way), and more significantly from 9 and up until getting close to a plateau at 13 to 15. (At about 1/3rd the performance of the 1 to 6 stream good case).

Again, with Hyperthreading, the other logical core will almost certainly be generating some memory traffic if it's running at all, so a conservative limit like 4 output streams is not a bad plan. But performance doesn't fall off a cliff at 7 or 8, so don't necessarily fission your loops if that costs more total work.

See also Enhanced REP MOVSB for memcpy for more about regular RFO stores vs. no-RFO NT stores, and lots of x86 memory bandwidth issues. (Especially that memory/L3 cache latency limits single-core bandwidth on most CPUs, but it's worse on many-core Xeons: they surprisingly have lower single-core memory bandwidth than a quad-core desktop. With enough cores busy, you can saturate their high aggregate bandwidth from quad or 6 channel memory controllers; that's the situation they're optimized for.)

2.5) DRAM page locality: write-back to memory happens when data is eventually evicted from L3 (Last level cache). The dirty cache lines get sent to the memory controller which can buffer and batch them up into groups, but there will still be a mix of stores (and RFO loads) to all 10 arrays. A dual channel memory controller can't have 10 DRAM pages open at once. (I think only 1 per channel, but I'm not an expert on DRAM timings. See Ulrich Drepper's What Every Programmer Should Know About Memory which does have some details.) https://pubweb.eng.utah.edu/~cs6810/pres/12-6810-15c.pdf mentions DRAM open/closed page policies for streaming vs. scattered stores.

The bottom line here is that even if cache could handle many output streams, DRAM is probably happier with fewer. Note that a DRAM "page" is not the same size as a virtual memory page (4k) or hugepage (2M).

Speaking of virtual memory, the TLB should be fine with 10 output streams: modern x86 CPUs have many more than 10 L1dTLB entries. Hopefully they're associative enough, or the entries don't all alias, so we don't get a TLB-miss on every store!

3) Compile-time alias analysis

@RichardHodges spotted this one)

Your big combined loop doesn't auto-vectorize with gcc or clang. They can't prove that list1[10] isn't also list4[9] or something, so they can't store list1[8..11] with a single 16-byte store.

But the single-array loops can easily auto-vectorize with SSE or AVX. (Surprisingly not to a wmemset call or something, just with the regular auto-vectorizer only at gcc -O3, or clang -O2. That might switch to NT stores for large sizes, which would help most if multiple cores are competing for memory bandwidth. memset pattern-recognition is / would be useful even without auto-vectorization.)

The only alias analysis required here is to prove that list1[i] = 2 doesn't modify the list1 pointer value itself (because the function reads the global inside the loop, instead of copying the value to a local). Type-based aliasing analysis (-fstrict-aliasing is on by default) allows the compiler to prove that, and/or the fact that if list1 was pointing to itself, there would be undefined behaviour from accessing outside the object in later loop iterations.

Smart compilers can and do check for overlap before auto-vectorizing in some cases (e.g. of output arrays against input arrays) when you fail to use the __restrict keyword (borrowed by several compilers from C's restrict). If there is overlap, they fall back to a safe scalar loop.

But that doesn't happen in this case: gcc and clang don't generate a vectorized loop at all, they just do scalar in myFunc1. If each store causes a conflict miss in L1d, this makes this 4x worse than if you'd given the compiler enough info to do its job. (Or 8x with AVX for 32-byte stores). Normally the difference between 16B vs. 32B stores is minor when main memory bandwidth is the bottleneck (not L1d cache), but here it could a big deal because 10 output streams breaks the write-combining effect of L1d if they all alias.

BTW, making the global variables static int *__restrict line1 and so on does allow gcc to auto-vectorize the stores in myFunc1. It doesn't fission the loop, though. (It would be allowed to, but I guess it's not looking for that optimization. It's up to the programmer to do that.)

// global modifier allows auto-vec of myFunc1
#define GLOBAL_MODIFIER  __restrict
#define LOCAL_MODIFIER  __restrict  // inside myFunc1

static int *GLOBAL_MODIFIER list1, *GLOBAL_MODIFIER list2,
       *GLOBAL_MODIFIER list3, *GLOBAL_MODIFIER list4,
       *GLOBAL_MODIFIER list5, *GLOBAL_MODIFIER list6,
       *GLOBAL_MODIFIER list7, *GLOBAL_MODIFIER list8,
       *GLOBAL_MODIFIER list9, *GLOBAL_MODIFIER list10;

I put your code on the Godbolt compiler explorer with gcc8.1 and clang6.0, with that change + a function that reads from one of the arrays to stop them from optimizing away entirely (which they would because I made them static.)

Then we get this inner loop which should probably run 4x faster than the scalar loop doing the same thing.

.L12:    # myFunc1 inner loop from gcc8.1 -O3  with __restrict pointers
    movups  XMMWORD PTR [rbp+0+rax], xmm9       # MEM[base: l1_16, index: ivtmp.87_52, offset: 0B], tmp108
    movups  XMMWORD PTR [rbx+rax], xmm8 # MEM[base: l2_17, index: ivtmp.87_52, offset: 0B], tmp109
    movups  XMMWORD PTR [r11+rax], xmm7 # MEM[base: l3_18, index: ivtmp.87_52, offset: 0B], tmp110
    movups  XMMWORD PTR [r10+rax], xmm6 # MEM[base: l4_19, index: ivtmp.87_52, offset: 0B], tmp111
    movups  XMMWORD PTR [r9+rax], xmm5  # MEM[base: l5_20, index: ivtmp.87_52, offset: 0B], tmp112
    movups  XMMWORD PTR [r8+rax], xmm4  # MEM[base: l6_21, index: ivtmp.87_52, offset: 0B], tmp113
    movups  XMMWORD PTR [rdi+rax], xmm3 # MEM[base: l7_22, index: ivtmp.87_52, offset: 0B], tmp114
    movups  XMMWORD PTR [rsi+rax], xmm2 # MEM[base: l8_23, index: ivtmp.87_52, offset: 0B], tmp115
    movups  XMMWORD PTR [rcx+rax], xmm1 # MEM[base: l9_24, index: ivtmp.87_52, offset: 0B], tmp116
    movups  XMMWORD PTR [rdx+rax], xmm0 # MEM[base: l10_25, index: ivtmp.87_52, offset: 0B], tmp117
    add     rax, 16   # ivtmp.87,
    cmp     rax, 40000000     # ivtmp.87,
    jne     .L12      #,

(This is compiling for x86-64, of course. x86 32-bit doesn't have enough registers to keep all the pointers in regs, so you'd have a few loads. But those would hit in L1d cache and not actually be much of a throughput bottleneck: at a 1 store per clock bottleneck, there's plenty of throughput to get some more work done in this case where you're just storing constants.)

This optimization is like unrolling the loop 4x and the re-arranging to group 4 stores to each array together. This is why it can't be done if the compiler doesn't know about them being non-overlapping. clang doesn't do it even with __restrict, unfortunately. The normal use of __restrict to promise non-overlapping is on function args, not locals or globals, but I didn't try that.

With global arrays instead of global pointers, the compiler would know they didn't overlap (and there wouldn't be a pointer value stored in memory anywhere; the array addresses would be link-time constants.) In your version, the arrays themselves have dynamic storage and it's only the pointers to them that have static storage.

Interleaved full-cache-line stores:

What if myFunc1 stored 64 bytes to one array before moving on to the next? Then your compiler could safely compile it to 4 (SSE), 2 (AVX), or 1 (AVX512) vector stores per array per iteration, covering a full 64 bytes.

If you aligned your pointers by 64 (or if the compiler did some alias analysis and got to the first 64-byte boundary in each output array), then each block of stores would fully write a cache line, and we wouldn't touch it again later.

That would avoid L1d conflict-misses, right? Well maybe, but unless you use NT stores to avoid RFOs, the HW prefetchers need to be pulling lines into L2 and then into L1d before the stores try to commit. So it's not as simple as you might think, but the write-combining buffers that combine stores to cache lines that haven't arrived yet can help.

The L2 streamer prefetcher in Intel CPUs can track 1 forward and 1 backward access per page, so it should be ok (if the arrays don't alias in L2). It's the L1d prefetching that's the big problem.

It would still greatly reduce the amount of cache lines bouncing to/from L2. If you ever have a loop that can't fission easily into multiple loops, at least unroll it so you can write a full cache line before moving on

AVX512 might make a difference; IDK if an aligned vmovdqa64 [mem], zmm0 on Skylake-AVX512 can maybe skip loading the old value when getting the cache line into MESI Modified state, because it knows it's overwriting the entire cache line. (If done without merge-masking).

gcc8.1 doesn't bother to align output pointers even with AVX512; a possibly-overlapping first and last vector would probably be a good strategy for easy cases like this where writing the same memory twice isn't a problem. (Alignment makes more difference for AVX512 than for AVX2 on Skylake hardware.)

4) Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake shows that interleaving dummy writes (to the same location) with a stream of stores can make it worse than 1 contiguous stream, for L1d / L2 bandwidth.

Possibly because of store-merging / coalescing happening in the store buffer before commit to L1d cache. But only for adjacent stores to the same cache line (because x86's strongly-ordered memory model can't allow stores to commit to L1d out of order).

That test doesn't suffer from the cache-conflict problems. But writing a whole cache line contiguously should help some there, too.

This code creates the variables:

    list1 = new int[n]; list2 = new int[n];
    list3 = new int[n]; list4 = new int[n];
    list5 = new int[n]; list6 = new int[n];
    list7 = new int[n]; list8 = new int[n];
    list9 = new int[n]; list10 = new int[n];

but it almost certainly does not create the actual physical page mappings until the memory is actually modified. See Does malloc lazily create the backing pages for an allocation on Linux (and other platforms)? for an example.

So your func1() has to wait for the creation of the actual physical pages of RAM, whereas your func2() doesn't. Change the order, and the mapping time will be attributed to func2() performance.

The easiest solution given your code as posted is to run either func1() or func2() before doing your timed runs.

If you don't ensure that the actual physical memory has been mapped before you do any benchmarking, that mapping will be part of the time you measure when you first modify the memory.

The thing you gain from a single loop, is the fact that you lose the incrememnting of the loop variable. So in a case like this where the content of the loop is so trivial, that assignment (and test) makes a big difference.

What your example fails to take into consideration also; is that contiguous memory access will often be faster than random access.

In a function where the loop takes much longer (try putting a sleep in rather than assignment) you'll find that the difference works it's way into not a lot.

THE way to get performance improvements is to start with the Maths - the right algorithm will always buy the biggest improvements. This is done, ideally, before finger hits keyboard.

For-loop efficiency: merging loops

3) Compile-time alias analysis

Interleaved full-cache-line stores:

Tags:

Performance

C++

Loops

Benchmarking

Related

Recent Posts