Optimize nested if statements within a loop in C/C++ with GCC

Consider templates. The challenge is in mapping runtime values to compile-time template parameters. The boilerplate below is one dispatch function per parameter, and the compiler will create the tree of combinations for you. Not exactly elegant, but scales much better than open-coding a multi-parameter switchyard.

You can also use the template parameters (or functions of them) directly in your computations, and those will be optimized out as well, for example choosing a constant based on a template parameter, or multiplying a 0 into an expression term that you don't want to contribute.

template <bool B0, bool B1, bool B2>
void doStuffStage3()
{
    // Once you get here, you can use B0, B1, and B2 in
    // any expressions you want, in the inner loop, and the compiler
    // will optimize everything out since they're known compile-time.  Basically,
    // the compiler will create separate versions of this function
    // for all required combinations of the input
    do {
        if(B0) {

        } else {

        }
    } while(testCondition());
}

template <bool B0, bool B1>
void doStuffStage2(bool b2)
{
    if(b2) doStuffStage3<B0,B1,true>();
    else   doStuffStage3<B0,B1,false>();
}

template <bool B0>
void doStuffStage1(bool b1, bool b2)
{
    if(b1) doStuffStage2<B0,true> (b2);
    else   doStuffStage2<B0,false>(b2);
}

void doStuff(bool b0, bool b1, bool b2)
{
    if(b0) doStuffStage1<true> (b1, b2);
    else   doStuffStage1<false>(b1, b2);
}

int main()
{
    doStuff(getA(), getB(), getC());
}

The Theory:

Trying to optimize your code through some wacky rewriting might make it difficult for the compiler to make its usual optimizations. The compiler and also the processor can optimize the code using 2 techniques:

  1. Branch prediction: The compiler can do that by using profile guided optimizations, mainly by estimating the probability of each branch. The CPU has also branch target buffers that try to detect the branching pattern, in addition to calculating statistics for each target.
  2. Branch predication: The compiler or CPU will make the code execute both branches in parallel (because nowadays processors are superscalar) and based on the condition result, it will just disregard the results of the incorrect path (e.g. CMOV instruction). You can try to disable branch predication using: -fno-if-conversion and -fno-if-conversion2. This might help if there is much computation on each branch and executing all paths will lead to a waste of instruction decoders and execution ports.

As a simple developer, using gcc, you can also help branch prediction or code generation using the "likely" and "unlikely" compilation hints. Check here for more details. This might work if you know for example that one condition is more likely to happen than another.

To see the branch prediction efficiency, use perf stat ./binary and check out the branch miss ratio, and the number of branch misses for each optimization you do.

In your code case:

If conditionA, conditionB and conditionC are computed before the loop, and do not change, then it is easy for the branch predictor to detect the pattern. The CPU's predictor does that by keeping track of the last branches taken/not taken and it will use the recorded history to predict the following branches. So I actually expect very little performance penalty due to branches in your code, which you can verify as above.