Copying a bool from a parameter to a global - comparing compilers output

TL:DR: gcc's version is the most robust across all x86 uarches, avoiding false dependencies or extra uops. None of them are optimal; loading both bytes with one load should be even better.

The 2 key points here are:

The mainstream compilers only care about out-of-order x86 uarches for their default tuning for instruction selection and scheduling. All x86 uarches that are currently sold do out-of-order execution with register renaming (for full registers like RAX at least).

No in-order uarches are still relevant for tune=generic. (Older Xeon Phi, Knight's Corner, used modified Pentium P54C-based in-order cores, and in-order Atom system might still be around, but that's obsolete now, too. In that case it would be important to do the stores after both loads, to allow memory-parallelism in the loads.)
8 and 16-bit Partial registers are problematic, and can lead to false dependencies. Why doesn't GCC use partial registers? explains the different behaviours for a variety of x86 uarches.

partial-register renaming to avoid false dependencies:

Intel before IvyBridge renames AL separately from RAX (P6 family and SnB itself, but not later SnB-family). On all other uarches (including Haswell/Skylake, all AMD, and Silvermont / KNL), writing AL merges into RAX. For more about modern Intel (HSW and later) vs. P6-family and first-gen Sandybridge, see this Q&A: How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent.

On Haswell/Skylake, mov al, [rdi] decodes to a micro-fused ALU + load uop that merges the load result into RAX. (This is nice for bitfield merging, instead of having extra cost for the front-end to insert a later merging uop when reading the full register).

It performs identically to how add al, [rdi] or add rax, [rdi]. (It's only an 8-bit load, but it has a dependency on the full width of the old value in RAX. Write-only instructions to low-8/low-16 regs like al or ax are not write-only as far as the microarchitecture is concerned.)

On P6 family (PPro to Nehalem) and Sandybridge (first generation of Sandybridge-family), clang's code is perfectly fine. Register-renaming makes the load/store pairs totally independent from each other, as if they'd used different architectural registers.

On all other uarches, Clang's code is potentially dangerous. If RAX was the target of some earlier cache-miss load in the caller, or some other long dependency chain, this asm would make the stores dependent on that other dep-chain, coupling them together and removing the opportunity for the CPU to find ILP.

The loads are still independent, because the loads are separate from the merging and can happen as soon as the load address rdi is known in the out-of-order core. The store-address is also known, so the store-address uops can execute (so later loads/stores can check for overlap), but the store-data uops are stuck waiting for the merge uops. (Stores on Intel are always 2 separate uops, but they can micro-fuse together in the front-end.)

Clang doesn't seem to understand partial registers very well, and creates false deps and partial-reg penalties for no reason sometimes, even when it doesn't save any code-size by using a narrow or al,dl instead of or eax,edx, for example.

In this case it saves a byte of code size per load (movzx has a 2-byte opcode).

Why does gcc use movzx eax, byte ptr [mem]?

Writing EAX zero-extends to the full RAX, so it's always write-only with no false dependency on the old value of RAX on any CPU. Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?.

movzx eax, m8/m16 is handled purely in the load ports, not as a load + ALU-zero-extend, on Intel, and on AMD since Zen. The only extra cost is 1 byte of code-size. (AMD before Zen has 1 cycle of extra latency for movzx loads, and apparently they have to run on an ALU as well as a load port. Doing sign/zero-extension or broadcast as part of a load with no extra latency is the modern way, though.)

gcc is pretty fanatical about breaking false dependencies, e.g. pxor xmm0,xmm0 before cvtsi2ss/sd xmm0, eax, because Intel's poorly-designed instruction set merges into the low qword of the destination XMM register. (Short-sighted design for PIII which stores 128-bit registers as 2 64-bit halves, so int->FP conversion instructions would have taken an extra uop on PIII to also zero the high half if Intel had designed it with future CPUs in mind.)

The problem isn't usually within a single function, it's when these false dependencies end up creating a loop-carried dependency chain across call/ret in different functions that you can unexpectedly get a big slowdown.

For example, store-data throughput is only 1 per clock (on all current x86 uarches), so 2 loads + 2 stores already takes at least 2 clocks.

If the struct is split across a cache line boundary, though, and the first load misses but the 2nd hits, avoiding a false dep would let the 2nd store write data to the store buffer before the first cache miss is finished. This would let loads on this core read from out2 via store-forwarding. (x86's strong memory ordering rules prevent the later store from becoming globally visible by committing to the store buffer ahead of the store to out1, but store-forwarding within a core/thread still works.)

cmp/setcc: MSVC / ICC are just being dumb

The one advantage here is that putting the value into ZF avoids any partial-register shenanigans, but movzx is a better way to avoid it.

I'm pretty sure MS's x64 ABI agrees with the x86-64 System V ABI that a bool in memory is guaranteed to be 0 or 1, not 0 / non-zero.

In the C++ abstract machine, x == true has to be the same as x for a bool x, so (unless an implementation used different object-representation rules in structs vs. extern bool), it can always just copy the object representation (i.e. the byte).

If an implementation was going to use a one-byte 0 / non-0 (instead of 0 / 1) object representation for bool, it would need to cmp byte ptr [rcx], 0 to implement the booleanization in (int)(x == true), but here you're assigning to another bool so it could just copy. And we know it's not booleanizing 0 / non-zero because it compared against 1. I don't think it's intentionally being defensive against invalid bool values, otherwise why wouldn't it do that for out2 = in.in2?

This just looks like a missed-optimization. Compilers are not generally awesome at bool in general. Boolean values as 8 bit in compilers. Are operations on them inefficient?. Some are better than others.

MSVC's setcc directly to memory is not bad, but cmp + setcc is 2 extra unnecessary ALU uops that didn't need to happen. For Ryzen, setcc m8 is 1 uop, 1/clock throughput (https://uops.info/). (Agner Fog reports one per 2 clocks for it, https://agner.org/optimize/. That's probably a typo, or maybe different measurement methodology, because automated testing/reporting by https://uops.info/ finds setcc [mem] is 1/clock throughput. On Steamroller, it's 1 uop / 1 per clock, and Zen didn't make much if anything worse than Bulldozer-family.)

On Intel, setcc m8 is 2 fused-domain uops (ALU + micro-fused store, for 3 back-end uops) and 1 per clock throughput, like you'd expect. (Or better than 1/clock on Ice Lake with its extra store port, but still worse than register.)

Note that setcc can only decode in the "complex" decoder on Intel for reasons, because setbe / seta are (still unfortunately) 2 uops.

ICC's xor-zeroing before setz

I'm not sure if there's an implicit conversion to int anywhere in here in ISO C++'s abstract machine, or if == is defined for bool operands.

But anyway, if you are going to setcc into a register, it's not a bad idea to xor-zero it first for the same reason movzx eax,mem is better than mov al,mem. Even if you don't need the result zero-extended to 32-bit.

That's probably ICC's canned sequence for creating a boolean integer from a compare result.

It makes little sense to use xor-zero / cmp / setcc for the compare, but mov al, [m8] for the non-compare. The xor-zero is the direct equivalent of using a movzx load to break the false dependency here.

ICC is great at auto-vectorizing (e.g. it can auto-vectorize a search-loop like while(*ptr++ != 0){} while gcc/clang can only auto-vec loops with a trip count that's known ahead of the first iteration). But ICC is not great at little micro-optimizations like this; it often has asm output that looks more like the source (to its detriment) than gcc or clang.

all reads "started" before doing anything with the results - so this kind of interleaving still actually matters?

It's not a bad thing. Memory disambiguation usually allows loads after stores to run early anyway. Modern x86 CPUs even dynamically predict when a load won't overlap with earlier unknown-address stores.

If the load and store address are exactly 4k apart, they alias on Intel CPUs, and the load is falsely detected as dependent on the store.

Moving loads ahead of stores definitely makes things easier for the CPU; do this when possible.

Also, the front-end issues uops in-order into the out-of-order part of the core, so putting the loads first can let the 2nd one start maybe a cycle earlier. There's no benefit to having the first store done right away; it will have to wait for the load result before it can execute.

Reusing the same register does reduce register pressure. GCC likes to avoid register pressure all the time, even when there isn't any, like in this not-inlined stand-alone version of the function. In my experience, gcc tends to lean towards ways of generating code that create less register pressure in the first place, rather than only reining in its register use when there is actual register pressure after inlining.

So instead of having 2 ways of doing things, gcc sometimes just only has the less-register-pressure way which it uses even when not inlining. For example, GCC used to almost always use setcc al / movzx eax,al to booleanize, but recent changes have let it use xor eax,eax / set-flags / setcc al to take the zero-extension off the critical path when there's a free register that can be zeroed ahead of whatever sets flags. (xor-zeroing also writes flags).

passing through al as there's no memory to memory mov.

None worth using for single-byte copies, anyway. One possible (but sub-optimal) implementation is:

foo(In &):
    mov   rsi, rdi
    lea   rdi, [rip+out1]
    movsb               # read in1
    lea   rdi, [rip+out2]
    movsb               # read in2

An implementation that's probably better than any the compilers spotted is:

foo(In &):
    movzx  eax, word ptr [rdi]      # AH:AL = in2:in1
    mov    [rip+out1], al
    mov    [rip+out2], ah
    ret

Reading AH may have an extra cycle of latency, but this is great for throughput and code-size. If you care about latency, avoid the store/reload in the first place and use registers. (By inlining this function).

If the two struct members were written separately and very recently, this 2-byte load will incur a store-forwarding stall. (Just extra latency for this store-forwarding, not actually stalling the pipeline until the store buffer drains.)

The other microarchitectural danger with this is a cache-line split on the load (if in.in2 is the first byte of a new cache line). That could take an extra 10 cycles. Or on pre-Skylake, if it's also split across a 4k boundary the penalty can be 100 cycles extra latency. But other than that, x86 has efficient unaligned loads, and it's normally a win to combine narrow loads / stores to save uops. (gcc7 and later typically do this when initializing multiple struct members even in cases where it can't know that it won't cross a cache-line boundary.)

The compiler should be able to prove that In &in can't alias extern bool out1, out2, because they have static storage and different types.

If you just had 2 pointers to bool, you wouldn't know (without bool *__restrict out1) that they don't point to members of the In object. But static bool out2 can't alias members of a static In object. Then it wouldn't be safe to read in2 before writing out1, unless you checked for overlap first.

I've run all codes in a loop on Haswell. The following graph shows the execution time of each for 1 billion iterations in three cases:

There is a mov rax, qword [rdi+64] at the beginning of every iteration. This potentially creates a false register dependency (called dep in the graph).
There is a add eax, eax at the beginning of every iteration (called fulldep in the graph). This creates a loop-carried dependency and a false dependency. See also the image below for an illustration of all the true and false dependencies of add eax, eax, which also explains why it serializes execution in both directions.
Only partial register dependency (called nodep in the graph, which stands for no false dependency). So this case has one less instruction per iteration compared to the previous one.

In both cases, the same memory locations are being accessed in every iteration. For example, the Clang-like code that I tested looks like this:

mov     al, byte [rdi]
mov     byte [rsi + 4], al
mov     al, byte [rdi + 1]
mov     byte [rsi + 8], al

This is placed in a loop where rdi and rsi never change. There is no memory aliasing. The results clearly show that partial register dependencies inflict a 7.5% slowdown on Clang. Peter, MSVC, and gcc are all clear winners in terms of absolute performance. Also note that for the second case, Peter's code is doing slightly better (2.02c per iteration for gcc and msvc, 2.04c for icc, but only 2.00c for Peter). Another possible metric of comparison is code size.

enter image description here

Copying a bool from a parameter to a global - comparing compilers output

An implementation that's probably better than any the compilers spotted is:

Tags:

C++

Assembly

X86

Compiler Optimization

Micro Optimization

Related

Recent Posts