What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?

Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores.

There has been academic research on this and there is even a patent on "eliminating silent store invalidation propagation in shared memory cache coherency protocols". (Googling '"silent store" cache' if you are interested in more.)

For x86, this would interfere with MONITOR/MWAIT; some users might want the monitoring thread to wake on a silent store (one could avoid invalidation and add a "touched" coherence message). (Currently MONITOR/MWAIT is privileged, but that might change in the future.)

Similarly, such could interfere with some clever uses of transactional memory. If the memory location is used as a guard to avoid explicit loading of other memory locations or, in an architecture that supports such (such was in AMD's Advanced Synchronization Facility), dropping the guarded memory locations from the read set.

(Hardware Lock Elision is a very constrained implementation of silent ABA store elimination. It has the implementation advantage that the check for value consistency is explicitly requested.)

There are also implementation issues in terms of performance impact/design complexity. Such would prohibit avoiding read-for-ownership (unless the silent store elimination was only active when the cache line was already present in shared state), though read-for-ownership avoidance is also currently not implemented.

Special handling for silent stores would also complicate implementation of a memory consistency model (probably especially x86's relatively strong model). Such might also increase the frequency of rollbacks on speculation that failed consistency. If silent stores were only supported for L1-present lines, the time window would be very small and rollbacks extremely rare; stores to cache lines in L3 or memory might increase the frequency to very rare, which might make it a noticeable issue.

Silence at cache line granularity is also less common than silence at the access level, so the number of invalidations avoided would be smaller.

The additional cache bandwidth would also be an issue. Currently Intel uses parity only on L1 caches to avoid the need for read-modify-write on small writes. Requiring every write to have a read in order to detect silent stores would have obvious performance and power implications. (Such reads could be limited to shared cache lines and be performed opportunistically, exploiting cycles without full cache access utilization, but that would still have a power cost.) This also means that this cost would fall out if read-modify-write support was already present for L1 ECC support (which feature would please some users).

I am not well-read on silent store elimination, so there are probably other issues (and workarounds).

With much of the low-hanging fruit for performance improvement having been taken, more difficult, less beneficial, and less general optimizations become more attractive. Since silent store optimization becomes more important with higher inter-core communication and inter-core communication will increase as more cores are utilized to work on a single task, the value of such seems likely to increase.

It's possible to implement in hardware, but I don't think anybody does. Doing it for every store would either cost cache-read bandwidth or require an extra read port and make pipelining harder.

You'd build a cache that did a read/compare/write cycle instead of just write, and could conditionally leave the line in Exclusive state instead of Modified (of MESI). Doing it this way (instead of checking while it was still Shared) would still invalidate other copies of the line, but that means there's no interaction with memory-ordering. The (silent) store becomes globally visible while the core has Exclusive ownership of the cache line, same as if it had flipped to Modified and then back to Exclusive by doing a write-back to DRAM.

The read/compare/write has to be done atomically (you can't lose the cache line between the read and the write; if that happened the compare result would be stale). This makes it harder to pipeline data committing to L1D from the store queue.

In a multi-threaded program, it can be worth doing this as an optimization in software for shared variables only.

Avoiding invalidating everyone else's cache can make it worth converting

shared = x;

into

if(shared != x)
    shared = x;

I'm not sure if there are memory-ordering implications here. Obviously if the shared = x never happens, there's no release-sequence, so you only have acquire semantics instead of release. But if the value you're storing is often what's already there, any use of it for ordering other things will have ABA problems.

IIRC, Herb Sutter mentions this potential optimization in part 1 or 2 of his atomic Weapons: The C++ Memory Model and Modern Hardware talk. (A couple hours of video)

This is of course too expensive to do in software for anything other than shared variables where the cost of writing them is many cycles of delay in other threads (cache misses and memory-order mis-speculation machine clears: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)

Related: See this answer for more about x86 memory bandwidth in general, especially the NT vs. non-NT store stuff, and "latency bound platforms" for why single-threaded memory bandwidth on many-core Xeons is lower than on a quad-core, even though aggregate bandwidth from multiple cores is higher.

I find evidence that some modern x86 CPUs from Intel, including Skylake and Ice Lake client chips, can optimize redundant (silent) stores in at least one specific case:

An all zero cache line is overwritten fully or partially with more zeros.

That is, a "zeros over zeros" scenario.

For example, this chart shows the performance (the circles, measured on the left axis) and relevant performance counters for a scenario where a region of varying size is filed with 32-bit values of either zero or one, on Ice Lake:

Ice Lake Fill Performance

Once the region no longer fits in the L2 cache, there is a clear advantage for writing zeroes: the fill throughput is almost 1.5x higher. In the case of zeros, we also see that the evictions from L2 are not almost all "silent", indicating that no dirty data needed to written out, while in the other case all evictions are non-silent.

Some miscellaneous details about this optimization:

It optimizes the write-back of the dirty cache line, not the RFO which still needs to occur (indeed, the read is probably needed to decide that the optimization can be applied).
It seems to occur around the L2 or L2 <-> L3 interface. That is, I don't find evidence of this optimization for loads that fit in L1 or L2.
Because the optimization takes effect at some point outside the innermost layer of the cache hierarhcy, It is not necessary to only write zeros to take advantage: it is enough that the line contains all zeros only once it is written back to the L3. So starting with an all-zero line, you can do any amount of non-zero writes, followed by a final zero-write of the entire line¹, as long as the line does not escape to the L3 in the meantime.
The optimization has varying performance effects: sometimes the optimization is occurring based on observation of relevant perf counts, but there is almost no increased throughput. Other times the impact can be very large.
I don't find evidence of the effect in Skylake server or earlier Intel chips.

I wrote this up in more detail here, and there is an addendum for Ice Lake, which exhibits this effect more strongly here.

¹ Or, at least overwrite the non-zero parts of the line with zeros.

What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?

Tags:

X86

X86 64

Cpu Architecture

Cpu Cache

Memory Bandwidth

Related

Recent Posts