IBM example code, non re-entrant functions doesn't work in my system

Looking at the godbolt compiler explorer (after adding in the missing #include <unistd.h>), one sees that for almost any x86_64 compiler the code generated uses QWORD moves to load the ones and zeros in a single instruction.

        mov     rax, QWORD PTR main::ones[rip]
        mov     QWORD PTR data[rip], rax

The IBM site says On most machines, it takes several instructions to store a new value in data, and the value is stored one word at a time. which might have been true for typical cpus in 2005 but as the code shows is not true now. Changing the struct to have two longs rather than two ints would show the issue.

I previously wrote that this was "atomic" which was lazy. The program is only running on a single cpu. Each instruction will complete from the point of view of this cpu (assuming there is nothing else altering the memory such as dma).

So at the C level it is not defined that the compiler will chose a single instruction to write the struct, and so the corruption mentioned in the IBM paper can happen. Modern compilers targeting current cpus do use a single instruction. A single instruction is good enough to avoid corruption for a single threaded program.


That's not really re-entrancy; you're not running a function twice in the same thread (or in different threads). You can get that via recursion or passing the address of the current function as a callback function-pointer arg to another function. (And it wouldn't be unsafe because it would be synchronous).

This is just plain vanilla data-race UB (Undefined Behaviour) between a signal handler and the main thread: only sig_atomic_t is guaranteed safe for this. Others may happen to work, like in your case where an 8-byte object can be loaded or stored with one instruction on x86-64, and the compiler happens to choose that asm. (As @icarus's answer shows).

See MCU programming - C++ O2 optimization breaks while loop - an interrupt handler on a single-core microcontroller is basically the same thing as a signal handler in a single threaded program. In that case the result of the UB is that a load got hoisted out of a loop.

Your test-case of tearing actually happening because of data-race UB was probably developed / tested in 32-bit mode, or with an older dumber compiler that loaded the struct members separately.

In your case, the compiler can optimize the stores out from the infinite loop because no UB-free program could ever observe them. data is not _Atomic or volatile, and there are no other side-effects in the loop. So there's no way any reader could synchronize with this writer. This in fact happens if you compile with optimization enabled (Godbolt shows an empty loop at the bottom of main). I also changed the struct to two long long, and gcc uses a single movdqa 16-byte store before the loop. (This is not guaranteed atomic, but it is in practice on almost all CPUs, assuming it's aligned, or on Intel merely doesn't cross a cache-line boundary. Why is integer assignment on a naturally aligned variable atomic on x86?)

So compiling with optimization enabled would also break your test, and show you the same value every time. C is not a portable assembly language.

volatile struct two_int would also force the compiler not to optimize them away, but would not force it to load/store the whole struct atomically. (It wouldn't stop it from doing so either, though.) Note that volatile does not avoid data-race UB, but in practice it's sufficient for inter-thread communication and was how people built hand-rolled atomics (along with inline asm) before C11 / C++11, for normal CPU architectures. They're cache-coherent so volatile is in practice mostly similar to _Atomic with memory_order_relaxed for pure-load and pure-store, if used for types narrow enough that the compiler will use a single instruction so you don't get tearing. And of course volatile doesn't have any guarantees from the ISO C standard vs. writing code that compiles to the same asm using _Atomic and mo_relaxed.


If you had a function that did global_var++; on an int or long long that you run from main and asynchronously from a signal handler, that would be a way to use re-entrancy to create data-race UB.

Depending on how it compiled (to a memory destination inc or add, or to separate load/inc/store) it would be atomic or not with respect to signal handlers in the same thread. See Can num++ be atomic for 'int num'? for more about atomicity on x86 and in C++. (C11's stdatomic.h and _Atomic attribute provides equivalent functionality to C++11's std::atomic<T> template)

An interrupt or other exception can't happen in the middle of an instruction, so a memory-destination add is atomic wrt. context switches on a single-core CPU. Only a (cache coherent) DMA writer could "step on" an increment from a add [mem], 1 without a lock prefix on a single-core CPU. There aren't any other cores that another thread could be running on.

So it's similar to the case of signals: a signal handler runs instead of the normal execution of the thread handling the signal, so it can't be handled in the middle of one instruction.