Why is it not cost effective to inline functions with loops or switch statements?

The purpose of a coding style guide is to tell you that if you are reading it you are unlikely to have added an optimisation to a real compiler, even less likely to have added a useful optimisation (measured by other people on realistic programs over a range of CPUs), therefore quite unlikely to be able to out-guess the guys who did. At least, do not mislead them, for example, by putting the volatile keyword in front of all your variables.

Inlining decisions in a compiler have very little to do with 'Making a Simple Branch Predictor Happy'. Or less confused.

First off, the target CPU may not even have branch prediction.

Second, a concrete example:

Imagine a compiler which has no other optimisation (turned on) except inlining. Then the only positive effect of inlining a function is that bookkeeping related to function calls (saving registers, setting up locals, saving the return address, and jumping to and back) are eliminated. The cost is duplicating code at every single location where the function is called.

In a real compiler dozens of other simple optimisations are done and the hope of inlining decisions is that those optimisations will interact (or cascade) nicely. Here is a very simple example:

int f(int s)
{
 ...;
 switch (s) {
   case 1: ...; break;
   case 2: ...; break;
   case 42: ...; return ...;
 }
 return ...;
}

void g(...)
{
  int x=f(42);
  ...
}

When the compiler decides to inline f, it replaces the RHS of the assignment with the body of f. It substitutes the actual parameter 42 for the formal parameter s and suddenly it finds that the switch is on a constant value...so it drops all the other branches and hopefully the known value will allow further simplifications (ie they cascade).

If you are really lucky all calls to the function will be inlined (and unless f is visible outside) the original f will completely disappear from your code. So your compiler eliminated all the bookkeeping and made your code smaller at compile time. And made the code more local at runtime.

If you are unlucky, the code size grows, locality at runtime decreases and your code runs slower.

It is trickier to give a nice example when it is beneficial to inline loops because one has to assume other optimisations and the interactions between them.

The point is that it is hellishly difficult to predict what happens to a chunk of code even if you know all the ways the compiler is allowed to change it. I can't remember who said it but one should not be able to recognise the executable code produced by an optimising compiler.

I think it might be worth to extend the example provided by @user1666959. I'll answer to provide cleaner example code. Let's consider such scenario.

/// Counts odd numbers in range [0;number]
size_t countOdd(size_t number)
{
    size_t result = 0;
    for (size_t i = 0; i <= number; ++i)
    {
        result += (i % 2);
    }
    return result;
}

int main()
{
    return countOdd(5);
}

If the function is not inlined and uses external linking, it will execute whole loop. Imagine what happens when you inline it.

int main()
{
    size_t result = 0;
    for (size_t i = 0; i <= 5; ++i)
    {
        result += (i % 2);
    }
    return result;
}

Now let's enable loop unfolding optimization. Here we know that it iterates from 0 to 5, so it can be easily unfolded removing unwanted conditions in the code.

int main()
{
    size_t result = 0;
    // iteration 0
    size_t i = 0
    result += (i % 2);
    // iteration 1
    ++i
    result += (i % 2);
    // iteration 2
    ++i
    result += (i % 2);
    // iteration 3
    ++i
    result += (i % 2);
    // iteration 4
    ++i
    result += (i % 2);
    // iteration 5
    ++i
    result += (i % 2);
    return result;
}

No conditions, it is faster already but that's not all. We know the value of i, so why not passing it directly?

int main()
{
    size_t result = 0;
    // iteration 0
    result += (0 % 2);
    // iteration 1
    result += (1 % 2);
    // iteration 2
    result += (2 % 2);
    // iteration 3
    result += (3 % 2);
    // iteration 4
    result += (4 % 2);
    // iteration 5
    result += (5 % 2);
    return result;
}

Even simpler but whait, those operations are constexpr, we can calculate them during compilation.

int main()
{
    size_t result = 0;
    // iteration 0
    result += 0;
    // iteration 1
    result += 1;
    // iteration 2
    result += 0;
    // iteration 3
    result += 1;
    // iteration 4
    result += 0;
    // iteration 5
    result += 1;
    return result;
}

So now the compiler sees that some of those operations don't have any effects leaving only those, which change the value. After that it removes unnecessary temporary variables and performs as much calculations, as it can during compilation, your code ends up with:

int main()
{
    return 3;
}

While in your case, the performance degradation seems to be caused by branch mispredictions, I don't think that's the reason why the Google style guide advocates against inline functions containing loops or switch statements. There are use cases where the branch predictor can benefit from inlining.

A loop is often executed hundreds of times, so the execution time of the loop is much larger than the time saved by inlining. So the performance benefit is negligible (see Amdahl's law). OTOH, inlining functions results in increase of code size which has negative effects on the instruction cache.

In the case of switch statements, I can only guess. The rationale might be that jump tables can be rather large, wasting much more memory in the code segment than is obvious.

I think the keyword here is cost effective. Functions that cost a lot of cycles or memory are typically not worth inlining.

Inlining functions with conditional branches makes it more difficult for the CPU to accurately predict the branch statements, since each instance of the branch is independent.

If there are several branch statements, successful branch prediction saves a lot more cycles than the cost of calling the function.

Similar logic applies to unrolling loops with switch statements.

The Google guide referenced doesn't mention anything about functions returning values, so I'm assuming that reference is elsewhere, and requires a different question with an explicit citation.

Why is it not cost effective to inline functions with loops or switch statements?

Tags:

C++

C

Compiler Optimization

Inline

Related

Recent Posts