Why do some C compilers set the return value of a function in weird places?

Since eax isn't used, compilers can zero the register whenever they want, and it works as expected.

An interesting thing that you didn't notice is the icc -O2 version:

xor       eax, eax
or        DWORD PTR [rsp], 32832
ldmxcsr   DWORD PTR [rsp]
movdqu    XMMWORD PTR array[rip], xmm0
movdqu    XMMWORD PTR 16+array[rip], xmm0
mov       DWORD PTR 32+array[rip], eax   ; set to 0 using the value of eax
mov       DWORD PTR 36+array[rip], eax

notice that eax is zeroed for the return value, but also used to zero 2 memory regions (last 2 instructions), probably because the instruction using eax is shorter than the instruction with the immediate zero operand.

So two birds with one stone.


Different instructions have different latencies. Sometimes changing the order of instructions can speed up the code for several reasons. For example: If a certain instruction takes several cycles to complete, if it is at the end of the function the program just waits until it is done. If it is earlier in the function other things can happen while that instruction finishes. That is unlikely the actual reason here, though, on second thought, as xor of registers is I believe a low-latency instruction. Latencies are processor dependent though.

However, placing the XOR there may have to do with separating the mov instructions between which it is placed.

There are also optimizations that take advantage of the optimization capabilities of modern processors such as pipelining, branch prediction (not the case here as far as I can see....), etc. You need a pretty deep understanding of these capabilities to understand what an optimizer may do to take advantage of them.

You might find this informative. It pointed me to Agner Fog's site, a resource I have not seen before but has a lot of the information you wanted (or didn't want :-) ) to know but were afraid to ask :-)


Those memory accesses are expected to burn at least several clock cycles. You can move the xor without changing the functionality of the code. By pulling it back with one/some memory accesses after it it becomes free, doesnt cost you any execution time it is parallel with the external access (the processor finishes the xor and waits on the external activity rather than just waits on the external activity). If you put it in a clump of instructions without memory accesses it costs a clock at least. And as you probably know using the xor vs mov immediate reduces the size of the instruction, probably not costing clocks but saving space in the binary. A ghee whiz kinda cool optimization that dates back to the original 8086, and is still used today even if it doesnt save you much in the end.