How can I show that volatile assignment is not atomic?

Some answers/comments suggested sleeping in the writer. This is not useful; hammering away on the cache line changing it as often as possible is what you want. (And what you get with volatile assignments and reads.) An assignment will be torn when a MESI share request for the cache line arrives at the writer core between committing two halves of a store from the store buffer to L1d cache.

If you sleep, you're waiting a long time without creating a window for that to happen. Sleeping between halves would make it even more easy to detect, but you can't do that unless you use separate memcpy to write halves of the 64-bit integer or something.

Tearing between reads in the reader is also possible even if writes are atomic. This may be less likely, but still happens plenty in practice. Modern x86 CPUs can execute two loads per clock cycle (Intel since Sandybridge, AMD since K8). I tested with atomic 64-bit stores but split 32-bit loads on Skylake and tearing is still frequent enough to spew lines of text in a terminal. So the CPU didn't manage to run everything in lock-step with corresponding pairs of reads always executing in the same clock cycle. So there is a window for the reader getting its cache line invalidated between a pair of loads. (However, all the pending cache-miss loads while the cache line is owned by the writer core probably complete all at once when the cache line arrives. And the total number of load buffers available is an even number in existing microarchitectures.)


As you discovered, your test values both had the same upper half of 0, so this made it impossible to observe any tearing; only the 32-bit aligned low half was ever changing, and was changing atomically because your compiler guarantees at least 4-byte alignment for uint64_t, and x86 guarantees that 4-byte aligned loads/stores are atomic.

0 and -1ULL are the obvious choices. I used the same thing in a test-case for this GCC C11 _Atomic bug for a 64-bit struct.

For your case, I'd do this. read() and write() are POSIX system call names so I picked something else.

#include <cstdint>
volatile uint64_t sharedValue = 0;  // initializer = one of the 2 values!

void writer() {
    for (;;) {
        sharedValue = 0;
        sharedValue = -1ULL;  // unrolling is vastly simpler than an if
    }
}

void reader() {
    for (;;) {
        uint64_t val = sharedValue;
        uint32_t low = val, high = val>>32;
        if (low != high) {
            std::cout << "Tearing! Value: " << std::hex << val << '\n';
        }
    }
}

MSVC 19.24 -O2 compiles the writer to using a movlpd 64-bit store for the = 0, but two separate 32-bit stores of -1 for the = -1. (And the reader to two separate 32-bit loads). GCC uses a total of four mov dword ptr [mem], imm32 stores in the writer, like you'd expect. (Godbolt compiler explorer)

Terminology: it's always a race condition (even with atomicity you don't know which of the two values you're going to get). With std::atomic<> you'd only have that garden-variety race condition, no Undefined Behaviour.

The question is whether you actually see tearing from the data race Undefined Behaviour on the volatile object, on a specific C++ implementation / set of compile options, for a specific platform. Data race UB is a technical term with a more specific meaning than "race condition". I changed the error message to report the one symptom we check for. Note that data-race UB on a non-volatile object can have way weirder effects, like hosting the load or store out of loops, or even inventing extra reads leading to code that thinks one read was both true and false at the same time. (https://lwn.net/Articles/793253/)

I removed 2 redundant cout flushes: one from std::endl and one from std::flush. cout is line-buffered by default, or full buffered if writing to a file, which is fine. And '\n' is just as portable as std::endl as far as DOS line endings are concerned; text vs. binary stream mode handles that. endl is still just \n.

I simplified your check for tearing by checking that high_half == low_half. Then the compiler just has to emit one cmp/jcc instead of two extended-precision compares to see if the value is either 0 or -1 exactly. We know that there's no plausible way for false-negatives like high = low = 0xff00ff00 to happen on x86 (or any other mainstream ISA with any sane compiler).


So I guess that on x86 there is no need to use std::atomic for 32 bit data types?

Incorrect.

Hand-rolled atomics with volatile int can't give you atomic RMW operations (without inline asm or special functions like Windows InterlockedIncrement or GNU C builtin __atomic_fetch_add), and can't give you any ordering guarantees wrt. other code. (Release / acquire semantics)

When to use volatile with multi threading? - pretty much never.

Rolling your own atomics with volatile is still possible and de-facto supported by many mainstream compilers (e.g. the Linux kernel still does that, along with inline asm). Real-world compilers do effectively define the behaviour of data races on volatile objects. But it's generally a bad idea when there's a portable and guaranteed-safe way. Just use std::atomic<T> with std::memory_order_relaxed to get asm that's just as efficient as what you could get with volatile (for the cases where volatile works), but with guarantees of safety and correctness from the ISO C++ standard.

atomic<T> also lets you ask the implementation whether a given type can be cheaply atomic or not, with C++17 std::atomic<T>::is_always_lock_free or the older member function. (In practice C++11 implementations decided not to let some but not all instances of any given atomic be lock free based on alignment or something; instead they just give atomic the required alignas if there is one. So C++17 made a constant per-type constant instead of per-object member function way to check lock-freedom).

std::atomic can also give cheap lock-free atomicity for types wider than a normal register. e.g. on ARM, using ARMv6 strd / ldrd to store/load a pair of registers.

On 32-bit x86, a good compiler can implement std::atomic<uint64_t> by using SSE2 movq to do atomic 64-bit loads and stores, without falling back to the non-lock_free mechanism (a table of locks). In practice, GCC and clang9 do use movq for atomic<uint64_t> load/store. clang8.0 and earlier uses lock cmpxchg8b unfortunately. MSVC uses lock cmpxchg8b in an even more inefficient way. Change the definition of sharedVariable in the Godbolt link to see it. (Or if you use one each of default seq_cst and memory_order_relaxed stores in the loop, MSVC for some reason calls a ?store@?$_Atomic_storage@_K$07@std@@QAEX_KW4memory_order@2@@Z helper function for one of them. But when both stores are the same ordering, it inlines lock cmpxchg8b with much clunkier loops than clang8.0) Note that this inefficient MSVC code-gen is for a case where volatile wasn't atomic; in cases where it is, atomic<T> with mo_relaxed compiles nicely, too.

You generally can't get that wide-atomic code-gen from volatile. Although GCC does actually use movq for your if() bool write function (see the earlier Godbolt compiler explorer link) because it can't see through the alternating or something. It also depends on what values you use. With 0 and -1 it uses separate 32-bit stores, but with 0 and 0x0f0f0f0f0f0f0f0fULL you get movq for a usable pattern. (I used this to verify that you can still get tearing from just the read side, instead of hand-writing some asm.) My simple unrolled version compiles to just use plain mov dword [mem], imm32 stores with GCC. This is a good example of there being zero guarantee of how volatile really compiles in this level of detail.

atomic<uint64_t> will also guarantee 8-byte alignment for the atomic object, even if plain uint64_t might have only been 4-byte aligned.


In ISO C++, a data race on a volatile object is still Undefined Behaviour. (Except for volatile sig_atomic_t racing with a signal handler.)

A "data race" is any time two unsynchronized accesses happen and they're not both reads. ISO C++ allows for the possibility of running on machines with hardware race detection or something; in practice no mainstream systems do that so the result is just tearing if the volatile object isn't "naturally atomic".

ISO C++ also in theory allows for running on machines that don't have coherent shared memory and require manual flushes after atomic stores, but that's not really plausible in practice. No real-world implementations are like that, AFAIK. Systems with cores that have non-coherent shared memory (like some ARM SoCs with DSP cores + microcontroller cores) don't start std::thread across those cores.

See also Why is integer assignment on a naturally aligned variable atomic on x86?

It's still UB even if you don't observe tearing in practice, although as I said real compilers do de-facto define the behaviour of volatile.


Skylake experiments to try to detect store-buffer coalescing

I wondered if store coalescing in the store buffer could maybe create an atomic 64-bit commit to L1d cache out of two separate 32-bit stores. (No useful results so far, leaving this here in case anyone's interested or wants to build on it.)

I used a GNU C __atomic builtin for the reader, so if the stores also ended up being atomic we'd see no tearing.

void reader() {
    for (;;) {
        uint64_t val = __atomic_load_n(&sharedValue, __ATOMIC_ACQUIRE);
        uint32_t low = val, high = val>>32;
        if (low != high) {
            std::cout << "Tearing! Value: " << std::hex << val << '\n';
        }
    }
}

This was one attempt get the microachitecture to group the stores.

void writer() {
    volatile int separator;  // in a different cache line, has to commit separately
    for (;;) {
        sharedValue = 0;

        _mm_mfence();
        separator = 1234;
        _mm_mfence();
        sharedValue = -1ULL;  // unrolling is vastly simpler than an if

        _mm_mfence();
        separator = 1234;
        _mm_mfence();
    }
}

I still see tearing with this. (mfence on Skylake with updated microcode is like lfence, and blocks out-of-order exec as well as draining the store buffer. So later stores shouldn't even enter the store buffer before the later ones leave. That might actually be a problem, because we need time for merging, not just committing a 32-bit store as soon as it "graduates" when the store uops retire).

Probably I should try to measure the rate of tearing and see if it's less frequent with anything, because any tearing at all is enough to spam a terminal window with text on a 4GHz machine.


Grab the disassembly and then check the documentation for your architecture; on some machines you'll find even standard "non-atomic" operations (in terms of C++) are actually atomic when it hits the hardware (in terms of assembly).

With that said, your compiler will know what is and isn't safe and it is therefore a better idea to use the std::atomic template to make your code more portable across architectures. If you're on a platform that doesn't require anything special, it'll typically get optimised down to a primitive type anyway (putting memory ordering aside).

I don't remember the details of x86 operations off hand, but I would guess that you have a data race if the 64 bit integer is written in 32 bit "chunks" (or less); it is possible to get a torn read that case.

There are also tools called thread sanitizer to catch it in the act. I don't believe they're supported on Windows with MSVC, but if you can get GCC or clang working then you might have some luck there. If your code is portable (it looks it), then you can run it on a Linux system (or VM) using these tools.


I changed the code to:

volatile uint64_t sharedValue = 0;
const uint64_t value1 = 0;
const uint64_t value2 = ULLONG_MAX;

and now the code triggers the race condition in less than a second. The problem was that both 13 and 1414 has the 32 MSB = 0.

13=0xd
1414=0x586
0=0x0
ULLONG_MAX=0xffffffffffffffff