C++ standard: can relaxed atomic stores be lifted above a mutex lock?

I think I've figured out the particular partial order edges that guarantee the program can't crash. In the answer below I'm referencing version N4659 of the draft standard.

The code involved for the writer thread A and reader thread B is:

A1: mu.lock()
A2: foo = 1
A3: foo_has_been_set.store(relaxed)
A4: mu.unlock()

B1: foo_has_been_set.load(relaxed) <-- (stop if false)
B2: mu.lock()
B3: assert(foo == 1)
B4: mu.unlock()

We seek a proof that if B3 executes, then A2 happens before B3, as defined in [intro.races]/10. By [intro.races]/10.2, it's sufficient to prove that A2 inter-thread happens before B3.

Because lock and unlock operations on a given mutex happen in a single total order ([thread.mutex.requirements.mutex]/5), we must have either A1 or B2 coming first. The two cases:

  1. Assume that A1 happens before B2. Then by [thread.mutex.class]/1 and [thread.mutex.requirements.mutex]/25, we know that A4 will synchronize with B2. Therefore by [intro.races]/9.1, A4 inter-thread happens before B2. Since B2 is sequenced before B3, by [intro.races]/9.3.1 we know that A4 inter-thread happens before B3. Since A2 is sequenced before A4, by [intro.races]/9.3.2, A2 inter-thread happens before B3.

  2. Assume that B2 happens before A1. Then by the same logic as above, we know that B4 synchronizes with A1. So since A1 is sequenced before A3, by [intro.races]/9.3.1, B4 inter-thread happens before A3. Therefore since B1 is sequenced before B4, by [intro.races]/9.3.2, B1 inter-thread happens before A3. Therefore by [intro.races]/10.2, B1 happens before A3. But then according to [intro.races]/16, B1 must take its value from the pre-A3 state. Therefore the load will return false, and B2 will never run in the first place. In other words, this case can't happen.

So if B3 executes at all (case 1), A2 happens before B3 and the assert will pass. ∎


The standard does not directly guarantee that, but you can read it between the lines of [thread.mutex.requirements.mutex].:

For purposes of determining the existence of a data race, these behave as atomic operations ([intro.multithread]).
The lock and unlock operations on a single mutex shall appear to occur in a single total order.

Now the second sentence looks like a hard guarantee, but it really isn't. Single total order is very nice, but it only means that there is a well-defined single total order of of acquiring and releasing one particular mutex. Alone by itself, that doesn't mean that the effects of any atomic operations, or related non-atomic operations should or must be globally visible at some particular point related to the mutex. Or, whatever. The only thing that is guaranteed is about the order of code execution (specifically, the execution of a single pair of functions, lock and unlock), nothing is being said about what may or may not happen with data, or otherwise.
One can, however, read between the lines that this is nevertheless the very intention from the "behave as atomic operations" part.

From other places, it is also pretty clear that this is the exact idea and that an implementation is expected to work that way, without explicitly saying that it must. For example, [intro.races] reads:

[ Note: For example, a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a release operation on those same locations.

Note the unlucky little, harmless word "Note:". Notes are not normative. So, while it's clear that this is how it's intended to be understood (mutex lock = acquire; unlock = release), this is not actually a guarantee.

I think the best, although non-straightforward guarantee comes from this sentence in [thread.mutex.requirements.general]:

A mutex object facilitates protection against data races and allows safe synchronization of data between execution agents.

So that's what a mutex does (without saying how exactly). It protects against data races. Fullstop.

Thus, no matter what subtleties one comes up with and no matter what else is written or isn't explicitly said, using a mutex protects against data races (... of any kind, since no specific type is given). That's what is written. So, in conclusion, as long as you use a mutex, you are good to go even with relaxed ordering or no atomic ops at all. Loads and stores (of any kind) cannot be moved around because then you couldn't be sure no data races occur. Which, however, is exactly what a mutex protects against.
Thus, without saying so, this says that a mutex must be a full barrier.


No memory operation inside a mutex protected region can 'escape' from that area. That applies to all memory operations, atomic and non-atomic.

In section 1.10.1:

a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex Correspondingly, a call that releases the same mutex will perform a release operation on those same locations

Furthermore, in section 1.10.1.6:

All operations on a given mutex occur in a single total order. Each mutex acquisition “reads the value written” by the last mutex release.

And in 30.4.3.1

A mutex object facilitates protection against data races and allows safe synchronization of data between execution agents

This means, acquiring (locking) a mutex sets a one-way barrier that prevents operations that are sequenced after the acquire (inside the protected area) from moving up across the mutex lock.

Releasing (unlocking) a mutex sets a one-way barrier that prevents operations that are sequenced before the release (inside the protected area) from moving down across the mutex unlock.

In addition, memory operations that are released by a mutex are synchronized (visible) with another thread that acquires the same mutex.

In your example, foo_has_been_set is checked in CheckFoo.. If it reads true you know that the value 1 has been assigned to foo by SetFoo, but it is not synchronized yet. The mutex lock that follows will acquire foo, synchronization is complete and the assert cannot fire.