Enforcing statement order in C++

I'd like to try to provide a somewhat more comprehensive answer after this was discussed with the C++ standards committee. In addition to being a member of the C++ committee, I'm also a developer on the LLVM and Clang compilers.

Fundamentally, there is no way to use a barrier or some operation in the sequence to achieve these transformations. The fundamental problem is that the operational semantics of something like an integer addition are totally known to the implementation. It can simulate them, it knows they cannot be observed by correct programs, and is always free to move them around.

We could try to prevent this, but it would have extremely negative results and would ultimately fail.

First, the only way to prevent this in the compiler is to tell it that all of these basic operations are observable. The problem is that this then would preclude the overwhelming majority of compiler optimizations. Inside the compiler, we have essentially no good mechanisms to model that the timing is observable but nothing else. We don't even have a good model of what operations take time. As an example, does converting a 32-bit unsigned integer to a 64-bit unsigned integer take time? It takes zero time on x86-64, but on other architectures it takes non-zero time. There is no generically correct answer here.

But even if we succeed through some heroics at preventing the compiler from reordering these operations, there is no guarantee this will be enough. Consider a valid and conforming way to execute your C++ program on an x86 machine: DynamoRIO. This is a system that dynamically evaluates the machine code of the program. One thing it can do is online optimizations, and it is even capable of speculatively executing the entire range of basic arithmetic instructions outside of the timing. And this behavior isn't unique to dynamic evaluators, the actual x86 CPU will also speculate (a much smaller number of) instructions and reorder them dynamically.

The essential realization is that the fact that arithmetic isn't observable (even at the timing level) is something that permeates the layers of the computer. It is true for the compiler, the runtime, and often even the hardware. Forcing it to be observable would both dramatically constrain the compiler, but it would also dramatically constrain the hardware.

But all of this should not cause you to lose hope. When you want to time the execution of basic mathematical operations, we have well studied techniques that work reliably. Typically these are used when doing micro-benchmarking. I gave a talk about this at CppCon2015: https://youtu.be/nXaxk27zwlk

The techniques shown there are also provided by various micro-benchmark libraries such as Google's: https://github.com/google/benchmark#preventing-optimization

The key to these techniques is to focus on the data. You make the input to the computation opaque to the optimizer and the result of the computation opaque to the optimizer. Once you've done that, you can time it reliably. Let's look at a realistic version of the example in the original question, but with the definition of foo fully visible to the implementation. I've also extracted a (non-portable) version of DoNotOptimize from the Google Benchmark library which you can find here: https://github.com/google/benchmark/blob/v1.0.0/include/benchmark/benchmark_api.h#L208

#include <chrono>

template <class T>
__attribute__((always_inline)) inline void DoNotOptimize(const T &value) {
  asm volatile("" : "+m"(const_cast<T &>(value)));
}

// The compiler has full knowledge of the implementation.
static int foo(int x) { return x * 2; }

auto time_foo() {
  using Clock = std::chrono::high_resolution_clock;

  auto input = 42;

  auto t1 = Clock::now();         // Statement 1
  DoNotOptimize(input);
  auto output = foo(input);       // Statement 2
  DoNotOptimize(output);
  auto t2 = Clock::now();         // Statement 3

  return t2 - t1;
}

Here we ensure that the input data and the output data are marked as un-optimizable around the computation foo, and only around those markers are the timings computed. Because you are using data to pincer the computation, it is guaranteed to stay between the two timings and yet the computation itself is allowed to be optimized. The resulting x86-64 assembly generated by a recent build of Clang/LLVM is:

% ./bin/clang++ -std=c++14 -c -S -o - so.cpp -O3
        .text
        .file   "so.cpp"
        .globl  _Z8time_foov
        .p2align        4, 0x90
        .type   _Z8time_foov,@function
_Z8time_foov:                           # @_Z8time_foov
        .cfi_startproc
# BB#0:                                 # %entry
        pushq   %rbx
.Ltmp0:
        .cfi_def_cfa_offset 16
        subq    $16, %rsp
.Ltmp1:
        .cfi_def_cfa_offset 32
.Ltmp2:
        .cfi_offset %rbx, -16
        movl    $42, 8(%rsp)
        callq   _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, %rbx
        #APP
        #NO_APP
        movl    8(%rsp), %eax
        addl    %eax, %eax              # This is "foo"!
        movl    %eax, 12(%rsp)
        #APP
        #NO_APP
        callq   _ZNSt6chrono3_V212system_clock3nowEv
        subq    %rbx, %rax
        addq    $16, %rsp
        popq    %rbx
        retq
.Lfunc_end0:
        .size   _Z8time_foov, .Lfunc_end0-_Z8time_foov
        .cfi_endproc


        .ident  "clang version 3.9.0 (trunk 273389) (llvm/trunk 273380)"
        .section        ".note.GNU-stack","",@progbits

Here you can see the compiler optimizing the call to foo(input) down to a single instruction, addl %eax, %eax, but without moving it outside of the timing or eliminating it entirely despite the constant input.

Hope this helps, and the C++ standards committee is looking at the possibility of standardizing APIs similar to DoNotOptimize here.


Summary:

There seems to be no guaranteed way to prevent reordering, but as long as link-time/full-program optimisation is not enabled, locating the called function in a separate compilation unit seems a fairly good bet. (At least with GCC, although logic would suggest that this is likely with other compilers too.) This comes at the cost of the function call - inlined code is by definition in the same compilation unit and open to reordering.

Original answer:

GCC reorders the calls under -O2 optimisation:

#include <chrono>
static int foo(int x)    // 'static' or not here doesn't affect ordering.
{
    return x*2;
}
int fred(int x)
{
    auto t1 = std::chrono::high_resolution_clock::now();
    int y = foo(x);
    auto t2 = std::chrono::high_resolution_clock::now();
    return y;
}

GCC 5.3.0:

g++ -S --std=c++11 -O0 fred.cpp :

_ZL3fooi:
        pushq   %rbp
        movq    %rsp, %rbp
        movl    %ecx, 16(%rbp)
        movl    16(%rbp), %eax
        addl    %eax, %eax
        popq    %rbp
        ret
_Z4fredi:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $64, %rsp
        movl    %ecx, 16(%rbp)
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, -16(%rbp)
        movl    16(%rbp), %ecx
        call    _ZL3fooi
        movl    %eax, -4(%rbp)
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, -32(%rbp)
        movl    -4(%rbp), %eax
        addq    $64, %rsp
        popq    %rbp
        ret

But:

g++ -S --std=c++11 -O2 fred.cpp :

_Z4fredi:
        pushq   %rbx
        subq    $32, %rsp
        movl    %ecx, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        call    _ZNSt6chrono3_V212system_clock3nowEv
        leal    (%rbx,%rbx), %eax
        addq    $32, %rsp
        popq    %rbx
        ret

Now, with foo() as an extern function:

#include <chrono>
int foo(int x);
int fred(int x)
{
    auto t1 = std::chrono::high_resolution_clock::now();
    int y = foo(x);
    auto t2 = std::chrono::high_resolution_clock::now();
    return y;
}

g++ -S --std=c++11 -O2 fred.cpp :

_Z4fredi:
        pushq   %rbx
        subq    $32, %rsp
        movl    %ecx, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movl    %ebx, %ecx
        call    _Z3fooi
        movl    %eax, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movl    %ebx, %eax
        addq    $32, %rsp
        popq    %rbx
        ret

BUT, if this is linked with -flto (link-time optimisation):

0000000100401710 <main>:
   100401710:   53                      push   %rbx
   100401711:   48 83 ec 20             sub    $0x20,%rsp
   100401715:   89 cb                   mov    %ecx,%ebx
   100401717:   e8 e4 ff ff ff          callq  100401700 <__main>
   10040171c:   e8 bf f9 ff ff          callq  1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
   100401721:   e8 ba f9 ff ff          callq  1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
   100401726:   8d 04 1b                lea    (%rbx,%rbx,1),%eax
   100401729:   48 83 c4 20             add    $0x20,%rsp
   10040172d:   5b                      pop    %rbx
   10040172e:   c3                      retq