why is c++ std::max_element so slow?

Before voting on this answer, please test (and verify) this on your machine and comment/add the results. Note that I used a vector size of 1000*1000*1000 for my tests. Currently, this answer has 19 upvotes but only one posted results, and these results did not show the effect described below (though obtained with a different test code, see comments).


There seems to be an optimizer bug/artifact. Compare the times of:

template<typename _ForwardIterator, typename _Compare>
_ForwardIterator
my_max_element_orig(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
  if (__first == __last) return __first;
  _ForwardIterator __result = __first;

  while(++__first != __last)
    if (__comp(__result, __first))
      __result = __first;

  return __result;
}

template<typename _ForwardIterator, typename _Compare>
_ForwardIterator
my_max_element_changed(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
  if (__first == __last) return __first;
  _ForwardIterator __result = __first;
  ++__first;

  for(; __first != __last; ++__first)
    if (__comp(__result, __first))
      __result = __first;

  return __result;
}

The first is the original libstdc++ implementation, the second one should be a transformation without any changes in behaviour or requirements. Clang++ produces very similar run times for those two functions, whereas g++4.8.2 is four times faster with the second version.


Following Maxim's proposal, changing the vector from int to int64_t, the changed version is not 4, but only 1.7 times faster than the original version (g++4.8.2).


The difference is in predictive commoning of *result, that is, storing the value of the current max element so that it does not have to be reloaded from memory each time. This gives a far cleaner cache access pattern:

w/o commoning     with commoning
*                 *
**                 *
 **                 *
  **                 *
  * *                 *
  *  *                 *
  *   *                 *

Here's the asm for comparison (rdi/rsi contain the first/last iterators respectively):

With the while loop (2.88743 ms; gist):

    movq    %rdi, %rax
    jmp .L49
.L51:
    movl    (%rdi), %edx
    cmpl    %edx, (%rax)
    cmovl   %rdi, %rax
.L49:
    addq    $4, %rdi
    cmpq    %rsi, %rdi
    jne .L51

With the for loop (1235.55 μs):

    leaq    4(%rdi), %rdx
    movq    %rdi, %rax
    cmpq    %rsi, %rdx
    je  .L53
    movl    (%rdi), %ecx
.L54:
    movl    (%rdx), %r8d
    cmpl    %r8d, %ecx
    cmovl   %rdx, %rax
    cmovl   %r8d, %ecx
    addq    $4, %rdx
    cmpq    %rdx, %rsi
    jne .L54
.L53:

If I force commoning by explicitly storing *result into a variable prev at the start and whenever result is updated, and using prev instead of *result in the comparison, I get an even faster loop (377.601 μs):

    movl    (%rdi), %ecx
    movq    %rdi, %rax
.L57:
    addq    $4, %rdi
    cmpq    %rsi, %rdi
    je  .L60
.L59:
    movl    (%rdi), %edx
    cmpl    %edx, %ecx
    jge .L57
    movq    %rdi, %rax
    addq    $4, %rdi
    movl    %edx, %ecx
    cmpq    %rsi, %rdi
    jne .L59
.L60:

The reason this is faster than the for loop is that the conditional moves (cmovl) in the above are a pessimisation as they are executed so rarely (Linus says that cmov is only a good idea if the branch is unpredictable). Note that for randomly distributed data the branch is expected to be taken Hn times, which is a negligible proportion (Hn grows logarithmically, so Hn/n rapidly approaches 0). The conditional-move code will only be better on pathological data e.g. [1, 0, 3, 2, 5, 4, ...].


You are probably running your test in 64-bit mode, where sizeof(int) == 4, but sizeof(std::vector<>::iterator) == 8, so that assignment in the loop to int (what my_max_element does) is faster than to std::vector<>::iterator (this is what std::max_element does).

If you change std::vector<int> to std::vector<long> results change in favour to std::max_element:

MaxIter = 1000000012
MaxArray = 1000000012
Total CPU time iterator = 0.00429082
Total CPU time array = 0.00572205
iter/array ratio: = 0.749875

One important note: when benchmarking disable the CPU frequency scaling, so that the CPU does not switch gears in the middle of the benchmark.


But I think something else is at play here, since just changing the loop variable from int to long does not change the results...


It's a simple issue of cache. To wit, the first time you load memory, in this case the contents of the vector, it's always considerably slower than if it's been recently accessed. I copied and pasted your code with GCC 4.9.

When the functions are reversed, the ratio is 1. When they're in the original order, the ratio is 1.6.

This still seems like a fundamental misoptimization by GCC in the case of max_element to me. However, your function times are so low, they will be dominated by CPU noise like the above cache effects, instead of any meaningful comparison.

Reversed, Original