Why does using the same cache-line from multiple threads not cause serious slowdown?

What you are seeing is basically the effect of the store buffer combined with store-to-load forwarding allowing each core to work mostly independently, despite sharing a cache line. As we will see below, it is truly a weird case where more contention is bad, up to a point, then even more contention suddenly makes things really fast!

Now with the conventional view of contention your code seems like something that will be high contention and therefore much slower than ideal. What happens, however, is that as soon as each core gets a single pending write in its write buffer, all later reads can be satisfied from the write buffer (store forwarding), and later writes just go into the buffer as well even after the core has lost ownership of the cache line. This turns most of the work into a totally local operation. The cache line is still bouncing around between the cores, but it's decoupled from the core execution path and is only needed to actually commit the stores now and then1.

The std::atomic version can't use this magic at all since it has to use locked operations to maintain atomicity and defeat the store buffer, so you see both the full cost of contention and the cost of the long-latency atomic operations2.

Let's try to actually collect some evidence that this is what's occurring. All of the discussion below deals with the non-atomic version of the benchmark that uses volatile to force reads and writes from buffer.

Let's first check the assembly, to make sure it's what we expect:

0000000000400c00 <fn(unsigned char volatile*)>:
  400c00:   ba 00 65 cd 1d          mov    edx,0x1dcd6500
  400c05:   0f 1f 00                nop    DWORD PTR [rax]
  400c08:   0f b6 07                movzx  eax,BYTE PTR [rdi]
  400c0b:   83 c0 01                add    eax,0x1
  400c0e:   83 ea 01                sub    edx,0x1
  400c11:   88 07                   mov    BYTE PTR [rdi],al
  400c13:   75 f3                   jne    400c08 <fn(unsigned char volatile*)+0x8>
  400c15:   f3 c3                   repz ret 

It's straightforward: a five instruction loop with a byte load, an increment of the loaded byte, a byte store, and finally the loop increment and conditional jump back to the top. Here, gcc has missed an optimization by breaking up the sub and jne, inhibiting macro-fusion, but overall it's OK and the store-forwarding latency is going to limit the loop in any case.

Next, let's take a look at the number of L1D misses. Every time a core needs to write into the line that has been stolen away, it will suffer an L1D miss, which we can measure with perf. First, the single threaded (N=1) case:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

       1070.188749      task-clock (msec)         #    0.998 CPUs utilized          
     2,775,874,257      cycles                    #    2.594 GHz                    
     2,504,256,018      instructions              #    0.90  insn per cycle         
       501,139,187      L1-dcache-loads           #  468.272 M/sec                  
            69,351      L1-dcache-load-misses     #    0.01% of all L1-dcache hits  

       1.072119673 seconds time elapsed

It is about what we expect: essentially zero L1D misses (0.01% of the total, probably mostly from interrupts and other code outside the loop), and just over 500,000,000 hits (matching almost exactly the number of loop iterations). Note also that we can easily calculate the cycles per iteration: about 5.55. This primarily reflects the cost of store-to-load forwarding, plus one cycle for the increment, which is a carried dependency chain as the same location is repeatedly updated (and volatile means it can't be hoisted into a register).

Let's take a look at the N=4 case:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

       5920.758885      task-clock (msec)         #    3.773 CPUs utilized          
    15,356,014,570      cycles                    #    2.594 GHz                    
    10,012,249,418      instructions              #    0.65  insn per cycle         
     2,003,487,964      L1-dcache-loads           #  338.384 M/sec                  
        61,450,818      L1-dcache-load-misses     #    3.07% of all L1-dcache hits  

       1.569040529 seconds time elapsed

As expected the L1 loads jumps from 500 million to 2 billion, since there are 4 threads each doing the 500 million loads. The number of L1D misses also jumped by about a factor of 1,000, to about 60 million. Still, that number is not a lot compared to the 2 billion loads (and 2 billion stores - not shown, but we know they are there). That's ~33 loads and ~33 stores for every miss. It also means 250 cycles between each miss.

That doesn't really fit the model of the cache line bouncing around erratically between the cores, where as soon as a core gets the line, another core demands it. We know that lines bounce around between cores sharing an L2 in perhaps 20-50 cycles, so the ratio of one miss every 250 cycles seems way to low.

Two Hypotheses

A couple ideas spring to mind for the above described behavior:

  • Perhaps the MESI protocol variant used in this chip is "smart" and recognizes that one line is hot among several cores, but only a small amount of work is being done each time a core gets the lock and the line spends more time moving between L1 and L2 than actually satisfying loads and stores for some core. In light of this, some smart component in the coherence protocol decides to enforce some kind of minimum "ownership time" for each line: after a core gets the line, it will keep it for N cycles, even if demanded by another core (the other cores just have to wait).

    This would help balance out the overhead of cache line ping-pong with real work, at the cost of "fairness" and responsiveness of the other cores, kind of like the trade-off between unfair and fair locks, and counteracting the effect described here, where the faster & fairer the coherency protocol is, the worse some (usually synthetic) loops may perform.

    Now I've never heard of anything like that (and the immediately previous link shows that at least in the Sandy-Bridge era things were moving in the opposite direction), but it's certainly possible!

  • The store-buffer effect described is actually occurring, so most operations can complete almost locally.

Some Tests

Let's try to distinguish two cases with some modifications.

Reading and Writing Distinct Bytes

The obvious approach is to change the fn() work function so that the threads still contend on the same cache line, but where store-forwarding can't kick in.

How about we just read from location x and then write to location x + 1? We'll give each thread two consecutive locations (i.e., thr[i] = std::thread(&fn, &buffer[i*2])) so each thread is operating on two private bytes. The modified fn() looks like:

for (int i=0; i<500000000; i++)
    unsigned char temp = p[0];
    p[1] = temp + 1;
}

The core loop is pretty much identical to earlier:

  400d78:   0f b6 07                movzx  eax,BYTE PTR [rdi]
  400d7b:   83 c0 01                add    eax,0x1
  400d7e:   83 ea 01                sub    edx,0x1
  400d81:   88 47 01                mov    BYTE PTR [rdi+0x1],al
  400d84:   75 f2                   jne    400d78

The only thing that's changed is that we write to [rdi+0x1] rather than [rdi].

Now as I mentioned above, the original (same location) loop is actually running fairly slowly at about 5.5 cycles per iteration even in the best-case single-threaded case, because of the loop-carried load->add->store->load... dependency. This new code breaks that chain! The load no longer depends on the store so we can execute everything pretty much in parallel and I expect this loop to run at about 1.25 cycles per iteration (5 instructions / CPU width of 4).

Here's the single threaded case:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

        318.722631      task-clock (msec)         #    0.989 CPUs utilized          
       826,349,333      cycles                    #    2.593 GHz                    
     2,503,706,989      instructions              #    3.03  insn per cycle         
       500,973,018      L1-dcache-loads           # 1571.815 M/sec                  
            63,507      L1-dcache-load-misses     #    0.01% of all L1-dcache hits                 

       0.322146774 seconds time elapsed

So about 1.65 cycles per iteration3, about about three times faster versus incrementing the same location.

How about 4 threads?

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

      22299.699256      task-clock (msec)         #    3.469 CPUs utilized          
    57,834,005,721      cycles                    #    2.593 GHz                    
    10,038,366,836      instructions              #    0.17  insn per cycle         
     2,011,160,602      L1-dcache-loads           #   90.188 M/sec                  
       237,664,926      L1-dcache-load-misses     #   11.82% of all L1-dcache hits  


       6.428730614 seconds time elapsed

So it's about 4 times slower than the same location case. Now rather than being just a bit slower than the single-threaded case it is about 20 times slower. This is the contention you've been looking for! Now also that the number of L1D misses has increased by a factor of 4 as well, nicely explaining the performance degradation and consistent with the idea that when store-to-load forwarding can't hide the contention, misses will increase by a lot.

Increasing the Distance Between Stores

Another approach would be to increase the distance in time/instructions between the store and the subsequent load. We can do this by incrementing SPAN consecutive locations in the fn() method, rather than always the same location. E.g, if SPAN is 4, increment consecutively 4 locations like:

for (long i=0; i<500000000 / 4; i++) {
    p[0]++;
    p[1]++;
    p[2]++;
    p[3]++;
}

Note that we are still incrementing 500 million locations in total, just spreading out the increments among 4 bytes. Intuitively you would expect overall performance to increase since you now have SPAN parallel dependency with length 1/SPAN, so in the case above you might expect performance to improve by a factor of 4, since the 4 parallel chains can proceed at about 4 times the total throughput.

Here's what we actually get for time (measured in cycles) for the 1 thread and 3 thread4, for SPAN values from 1 to 20:

Time vs Increment Distable

Initially you see performance increase substantially in both single and multi-threaded cases; the increase from a SPAN of one to two and three is close to the theoretical expected in the case of perfect parallelism for both cases.

The single-threaded case reaches an asymptote of about 4.25x faster than the single-location write: at this point the store-forwarding latency isn't the bottleneck and other bottlenecks have taken over (max IPC and store port contention, mostly).

The multi-threaded case is very different, however! Once you hit a SPAN of about 7, the performance rapidly gets worse, leveling out at about 2.5 times worse than the SPAN=1 case and almost 10x worse compared to the best performance at SPAN=5. What happens is that store-to-load forwarding stops occurring because the store and subsequent load are far enough apart in time/cycles that the store has retired to L1, so the load actually has to get the line and participate in MESI.

Also plotted is the L1D misses, which as mentioned above is indicative of "cache line transfers" between cores. The single-threaded case has essentially zero, and they are uncorrelated with the performance. The performance of the multi-threaded case, however, pretty much tracks exactly the cache misses. With SPAN values in the 2 to 6 range, where store-forwarding is still working, there are proportionally fewer misses. Evidently the core is able to "buffer up" more stores between each cache line transfer since the core loop is faster.

Another way to think of it is that in the contended case L1D misses are basically constant per unit-time (which makes sense, since they are basically tied to the L1->L2->L1 latency, plus some coherency protocol overhead), so the more work you can do in between the cache line transfers, the better.

Here's the code for the multi-span case:

void fn(Type *p) {
    for (long i=0; i<500000000 / SPAN; i++) {
        for (int j = 0; j < SPAN; j++) {
            p[j]++;
        }
    }
}

The bash script to run perf for all SPAN value from 1 to 20:

PERF_ARGS=${1:--x, -r10}

for span in {1..20}; do
    g++ -std=c++11 -g -O2 -march=native -DSPAN=$span  cache-line-increment.cpp  -lpthread -o cache-line-increment
    perf stat ${PERF_ARGS} -e cycles,L1-dcache-loads,L1-dcache-load-misses,machine_clears.count,machine_clears.memory_ordering ./cache-line-increment
done

Finally, "transpose" the results into proper CSV:

FILE=result1.csv; for metric in cycles L1-dcache-loads L1-dcache-load-misses; do { echo $metric; grep $metric $FILE | cut -f1 -d,; } > ${metric}.tmp; done && paste -d, *.tmp

A Final Test

There's a final test that you can do to show that each core is effectively doing most of its work in private: use the version of the benchmark where the threads work on the same location (which doesn't change the performance characteristics) examine the sum of the final counter values (you'd need int counters rather than char). If everything was atomic, you'd have a sum of 2 billion, and in the non-atomic case how close the total is to that value is a rough measure of how frequently the cores were passing around the lines. If the cores are working almost totally privately, the value would be closer to 500 million than 2 billion, and I guess that's what you'll find (a value fairly close to 500 million).

With some more clever incrementing, you can even have each thread track how often the value they incremented came from their last increment rather than another threads increment (e.g., by using a few bits of the value to stash a thread identifier). With an even more clever test you could practically reconstruct thee way the cache line moved around between the cores (is there a pattern, e.g., does core A prefer to hand off to core B?) and which cores contributed most to the final value, etc.

That's all left as an exercise :).


1 On top of that, if Intel has a coalescing store buffer where later stores that fully overlap earlier ones kill the earlier stores, it would only have to commit one value to L1 (the latest store) every time it gets the line.

2 You can't really separate the two effects here, but we will do it later by defeating store-to-load forwarding.

3 A bit more than I expected, perhaps bad scheduling leading to port pressure. If gcc would just all the sub and jne to fuse, it runs at 1.1 cycles per iteration (still worse than the 1.0 I'd expect). It will do that I use -march=haswell instead of -march=native but I'm not going to go back and change all the numbers.

4 The results hold with 4 threads as well: but I only have 4 cores and I'm running stuff like Firefox in the background, so using 1 less core makes the measurements a lot less noisy. Measuring time in cycles helps a lot too.

5 On this CPU architecture, store forwarding where the load arrives before the store data is ready seem to alternate between 4 and 5 cycles, for an average of 4.5 cycles.


The atomic version has to ensure that some other thread will be able to read the result in a sequentially consistent fashion. So there are fences for each write.

The volatile version does not make any relationships visible to the other cores, so does not try and synchronize the memory so it is visible on other cores. For a multi-threaded system using C++11 or newer, volatile is not a mechanism for communicating between threads.