Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?

The most likely cause of the speed improvement is that:

  • inserting a MOV shifts the subsequent instructions to different memory addresses
  • one of those moved instructions was an important conditional branch
  • that branch was being incorrectly predicted due to aliasing in the branch prediction table
  • moving the branch eliminated the alias and allowed the branch to be predicted correctly

Your Core2 doesn't keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. One disadvantage of global branch prediction is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated.

This little branch prediction tutorial shows how branch prediction buffers work. The cache buffer is indexed by the lower portion of the address of the branch instruction. This works well unless two important uncorrelated branches share the same lower bits. In that case, you end-up with aliasing which causes many mispredicted branches (which stalls the instruction pipeline and slowing your program).

If you want to understand how branch mispredictions affect performance, take a look at this excellent answer: https://stackoverflow.com/a/11227902/1001643

Compilers typically don't have enough information to know which branches will alias and whether those aliases will be significant. However, that information can be determined at runtime with tools such as Cachegrind and VTune.


You may want to read http://research.google.com/pubs/pub37077.html

TL;DR: randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this. It's usually a combination of branch predictor and cache behaviour, but it can just as well be e.g. a reservation station stall (even in case there are no dependency chains that are broken or obvious resource over-subscriptions whatsoever).


I believe in modern CPUs the assembly instructions, while being the last visible layer to a programmer for providing execution instructions to a CPU, actually are several layers from actual execution by the CPU.

Modern CPUs are RISC/CISC hybrids that translate CISC x86 instructions into internal instructions that are more RISC in behavior. Additionally there are out-of-order execution analyzers, branch predictors, Intel's "micro-ops fusion" that try to group instructions into larger batches of simultaneous work (kind of like the VLIW/Itanium titanic). There are even cache boundaries that could make the code run faster for god-knows-why if it's bigger (maybe the cache controller slots it more intelligently, or keeps it around longer).

CISC has always had an assembly-to-microcode translation layer, but the point is that with modern CPUs things are much much much more complicated. With all the extra transistor real estate in modern semiconductor fabrication plants, CPUs can probably apply several optimization approaches in parallel and then select the one at the end that provides the best speedup. The extra instructions may be biasing the CPU to use one optimization path that is better than others.

The effect of the extra instructions probably depends on the CPU model / generation / manufacturer, and isn't likely to be predictable. Optimizing assembly language this way would require execution against many CPU architecture generations, perhaps using CPU-specific execution paths, and would only be desirable for really really important code sections, although if you're doing assembly, you probably already know that.