How do memory_order_seq_cst and memory_order_acq_rel differ?

http://en.cppreference.com/w/cpp/atomic/memory_order has a good example at the bottom that only works with memory_order_seq_cst. Essentially memory_order_acq_rel provides read and write orderings relative to the atomic variable, while memory_order_seq_cst provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.

The example boils down to this:

bool x= false;
bool y= false;
int z= 0;

a() { x= true; }
b() { y= true; }
c() { while (!x); if (y) z++; }
d() { while (!y); if (x) z++; }

// kick off a, b, c, d, join all threads
assert(z!=0);

Operations on z are guarded by two atomic variables, not one, so you can't use acquire-release semantics to enforce that z is always incremented.


On ISAs like x86 where atomics map to barriers, and the actual machine model includes a store buffer:

  • seq_cst stores require flushing the store buffer so this thread's later reads are delayed until after the store is globally visible.

  • acquire or release do not have to flush the store buffer. Normal x86 loads and stores have essentially acq and rel semantics. (seq_cst plus a store buffer with store forwarding.)

    But x86 atomic RMW operations always get promoted to seq_cst because the x86 asm lock prefix is a full memory barrier. Other ISAs can do relaxed or acq_rel RMWs in asm, with the store side being able to do limited reordering with later stores. (But not in ways that would make the RMW appear non-atomic: For purposes of ordering, is atomic read-modify-write one operation or two?)


https://preshing.com/20120515/memory-reordering-caught-in-the-act is an instructive example of the difference between a seq_cst store and a plain release store. (It's actually mov + mfence vs. plain mov in x86 asm. In practice xchg is a more efficient way to do a seq_cst store on most x86 CPUs, but GCC does use mov+mfence)


Fun fact: AArch64's LDAR acquire-load instruction is actually a sequential-acquire, having a special interaction with STLR. Not until ARMv8.3 LDAPR can arm64 do plain acquire operations that can reorder with earlier release and seq_cst stores (STLR). (seq_cst loads still use LDAR because they need that interaction with STLR to recover sequential consistency; seq_cst and release stores both use STLR).

With STLR / LDAR you get sequential consistency, but only having to drain the store buffer before the next LDAR, not right away after each seq_cst store before other operations. I think real AArch64 HW does implement it this way, rather than simply draining the store buffer before committing an STLR.

Strengthening rel or acq_rel to seq_cst by using LDAR / STLR doesn't need to be expensive, unless you seq_cst store something, and then seq_cst load something else. Then it's just as bad as x86.

Some other ISAs (like PowerPC) have more choices of barriers and can strengthen up to mo_rel or mo_acq_rel more cheaply than mo_seq_cst, but their seq_cst can't be as cheap as AArch64; seq-cst stores need a full barrier.

So AArch64 is an exception to the rule that seq_cst stores drain the store buffer on the spot, either with a special instruction or a barrier instruction after. It's not a coincidence that ARMv8 was designed after C++11 / Java / etc. basically settled on seq_cst being the default for lockless atomic operations, so making them efficient was important. And after CPU architects had a few years to think about alternatives to providing barrier instructions or just acquire/release vs. relaxed load/store instructions.