Performance of "conditional call" on amd64

As @fuz pointed out in the comments, the performance issue is almost certainly due to the Return Address Stack (RAS), which is a specialized branch predictor for function returns.

As an advantage of having separate call and ret instructions from jmp and manual stack modification, CPUs are clued in to the intent of the running code. In particular, when we call a function it is probably going to ret and when it does we are going to jump back to the rip pushed before the call. In other words, calls are usually paired with a ret. The CPU leverages this by keeping a fixed-length stack of just return addresses called the return address stack (RAS). call instructions in addition to pushing the return address to the actual in-memory stack will additionally push it to the RAS. This way, when a ret is encountered the CPU can pop off of the RAS (which is much faster than the memory access for the actual stack) and speculatively execute the return. If it turns out that the address popped from the RAS was the one popped from the stack, the CPU continues with no penalty. However, if the RAS predicted the wrong return address, a pipeline flush occurs, which is costly.

My original intuition was that the conditional instructions would be better because they would give time for the result of the comparison to arrive before the jump. However, whatever benefit that may have provided, having an unbalanced jmp/ret (my conditional call replaced call with jmp, but the called function still used a ret) caused the RAS to likely always predict the wrong return address (and thus my approach, despite originally trying to avoid this, causes more pipeline stalls). The speedup from the RAS is more significant than my "optimization" so the branching approach outperformed the conditional call approach.

According to some empirical results mismatching call and ret (in particular using a jmp + ret) take 5-6 times more cycles than properly pairing call and ret. Some napkin math would suggest that a penalty of +21 cycles at 3.1GHz for 1,048,576 calls add about 7.1ms to the total runtime. The slowdown observed was less than that. This is likely a combination of the conditional instructions delaying the jump until the condition was ready and the fact that the jumps were oscillating between fixed locations in memory (which the other branch predictors likely became good at predicting).

You can exactly determine why the conditional_call approach is slower than branch_over_call. You've done your experiments on two KBL processors, but the blog post you were referred to doesn't discuss how the RAS works on KBL. So the first step of the analysis is to determine whether the ret in the negate function is being mispredicted or not (as what would happen on earlier microarchitectures). The second step is to determine what is the cost of mispredicting that ret instruction on total execution time. The closest thing that I have to KBL is CFL and my numbers turned out to be close to yours. The only relevant difference between the two is that LSD is enabled in CFL but disabled in KBL. However, the LSD is irrelevant in this case because of the call instruction in the loop which prevents the LSD from detecting any loop. You can also easily repeat the same analysis on KBL.

There are several ways to analyze the behavior of branch instructions. But in this particular case, the code is simple enough for the event counting method to reveal all the information that we need about every static branch instruction.

The BR_INST_RETIRED_* performance events can be used count the total number of dynamic branch instructions retired and the total number of specific types of retired branch instructions including conditional, calls, and returns. The BR_MISP_RETIRED_* events can be used to count total mispredictions, total conditional mispredictions, and total call mispredictions.

The complete control-glow graph of conditional_call looks like this:

           total   misp
call         1      0
    jl       1     0.5
       ret  0.5     1
    ret      1      0
jne          1      0

The first call instruction calls the conditional_call function, which contains jl and ret. The jl instruction conditionally jumps to the negate function, which contains ret. The jne instruction is used to for the loop. The numbers shown in the first and second column are normalized by the total number of iterations and total number of dynamic instructions, respectively. We know from the static structure of the program that call, jl, conditional_call's ret, and jne are each executed once in every iteration. The inner most ret is only executed when the jl branch is taken. Using the performance events, we can count the total number of executed return instructions and subtract from it the total number of iterations to get the number of times the inner most ret is executed. Because the input is randomized according to the uniform distribution, it shouldn't be surprising that the inner most ret is executed half of the time.

The call instruction is never mispredicted. The jne instruction is also never mispredicted except for the last execution of the instructions (where it exits the loop). Therefore, we can attribute the total number of conditional mispredictions to the jl instruction. That can be subtracted from the total number of mispredictions to get the number of return mispredictions which can be attributed to either or both of the return instructions. The second ret may mispredict when the misprediction of the first ret clobbers or misaligns the RAS. One way to determine whether the second ret is ever mispredicted is by using precise sampling of BR_MISP_RETIRED.ALL_BRANCHES. Another way is by using the method described in the blog post you cited. Indeed, only the inner most ret is mispredicted. The fact that jl is mispredicted half of the time suggests that the instruction is either being predicted always taken or always not taken.

The complete control-glow graph of branch_over_call looks like this:

           total   misp
call         1      0
    jg       1     0.5
    call    0.5     0
        ret 0.5     0
    ret      1      0
jne          1      0

The only instruction that is mispredicted is jg, which is mispredicted half of the time.

To measure the average cost of a single ret misprediction in the conditional_call approach, the ret instruction can be replaced with a lea/jmp sequence so that BTB rather than the RAS is used for making predictions. With this change, the only instruction that is mispredicted is jl. The difference in execution time can be considered as an estimate for the total cost of ret mispredictions. On my CFL processor, this is about 11.3 cycles per ret misprediction. In addition, conditional_call has become about 3% faster than branch_over_call. Your numbers on KBL indicate that the average cost of a ret misprediction is about 13 cycles. I'm not sure what the reason for this difference is. It may not be microarchitectural. I've used gcc 7.3 but you used gcc 8, so perhaps there are some differences in the code or the alignments of different pieces of code that are causing the discrepancy between our results.