Why is grouped summation slower with sorted groups than unsorted groups?

Setup / make it slow

First of all, the program runs in about the same time regardless:

sumspeed$ time ./sum_groups < groups_shuffled 
11558358

real    0m0.705s
user    0m0.692s
sys 0m0.013s

sumspeed$ time ./sum_groups < groups_sorted
24986825

real    0m0.722s
user    0m0.711s
sys 0m0.012s

Most of the time is spent in the input loop. But since we're interested in the grouped_sum(), let's ignore that.

Changing the benchmark loop from 10 to 1000 iterations, grouped_sum() starts dominating the run time:

sumspeed$ time ./sum_groups < groups_shuffled 
1131838420

real    0m1.828s
user    0m1.811s
sys 0m0.016s

sumspeed$ time ./sum_groups < groups_sorted
2494032110

real    0m3.189s
user    0m3.169s
sys 0m0.016s

perf diff

Now we can use perf to find the hottest spots in our program.

sumspeed$ perf record ./sum_groups < groups_shuffled
1166805982
[ perf record: Woken up 1 times to write data ]
[kernel.kallsyms] with build id 3a2171019937a2070663f3b6419330223bd64e96 not found, continuing without symbols
Warning:
Processed 4636 samples and lost 6.95% samples!

[ perf record: Captured and wrote 0.176 MB perf.data (4314 samples) ]

sumspeed$ perf record ./sum_groups < groups_sorted
2571547832
[ perf record: Woken up 2 times to write data ]
[kernel.kallsyms] with build id 3a2171019937a2070663f3b6419330223bd64e96 not found, continuing without symbols
[ perf record: Captured and wrote 0.420 MB perf.data (10775 samples) ]

And the difference between them:

sumspeed$ perf diff
[...]
# Event 'cycles:uppp'
#
# Baseline  Delta Abs  Shared Object        Symbol                                                                  
# ........  .........  ...................  ........................................................................
#
    57.99%    +26.33%  sum_groups           [.] main
    12.10%     -7.41%  libc-2.23.so         [.] _IO_getc
     9.82%     -6.40%  libstdc++.so.6.0.21  [.] std::num_get<char, std::istreambuf_iterator<char, std::char_traits<c
     6.45%     -4.00%  libc-2.23.so         [.] _IO_ungetc
     2.40%     -1.32%  libc-2.23.so         [.] _IO_sputbackc
     1.65%     -1.21%  libstdc++.so.6.0.21  [.] 0x00000000000dc4a4
     1.57%     -1.20%  libc-2.23.so         [.] _IO_fflush
     1.71%     -1.07%  libstdc++.so.6.0.21  [.] std::istream::sentry::sentry
     1.22%     -0.77%  libstdc++.so.6.0.21  [.] std::istream::operator>>
     0.79%     -0.47%  libstdc++.so.6.0.21  [.] __gnu_cxx::stdio_sync_filebuf<char, std::char_traits<char> >::uflow
[...]

More time in main(), which probably has grouped_sum() inlined. Great, thanks a lot, perf.

perf annotate

Is there a difference in where the time is spent inside main()?

Shuffled:

sumspeed$ perf annotate -i perf.data.old
[...]
       │     // This is the function whose performance I am interested in
       │     void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
       │       for (size_t i = 0; i < n; ++i) {
       │180:   xor    %eax,%eax
       │       test   %rdi,%rdi
       │     ↓ je     1a4
       │       nop
       │         p_out[p_g[i]] += p_x[i];
  6,88 │190:   movslq (%r9,%rax,4),%rdx
 58,54 │       mov    (%r8,%rax,4),%esi
       │     #include <chrono>
       │     #include <vector>
       │
       │     // This is the function whose performance I am interested in
       │     void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
       │       for (size_t i = 0; i < n; ++i) {
  3,86 │       add    $0x1,%rax
       │         p_out[p_g[i]] += p_x[i];
 29,61 │       add    %esi,(%rcx,%rdx,4)
[...]

Sorted:

sumspeed$ perf annotate -i perf.data
[...]
       │     // This is the function whose performance I am interested in
       │     void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
       │       for (size_t i = 0; i < n; ++i) {
       │180:   xor    %eax,%eax
       │       test   %rdi,%rdi
       │     ↓ je     1a4
       │       nop
       │         p_out[p_g[i]] += p_x[i];
  1,00 │190:   movslq (%r9,%rax,4),%rdx
 55,12 │       mov    (%r8,%rax,4),%esi
       │     #include <chrono>
       │     #include <vector>
       │
       │     // This is the function whose performance I am interested in
       │     void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
       │       for (size_t i = 0; i < n; ++i) {
  0,07 │       add    $0x1,%rax
       │         p_out[p_g[i]] += p_x[i];
 43,28 │       add    %esi,(%rcx,%rdx,4)
[...]

No, it's the same two instructions dominating. So they take a long time in both cases, but are even worse when the data is sorted.

perf stat

Okay. But we should be running them the same number of times, so each instruction must be getting slower for some reason. Let's see what perf stat says.

sumspeed$ perf stat ./sum_groups < groups_shuffled 
1138880176

 Performance counter stats for './sum_groups':

       1826,232278      task-clock (msec)         #    0,999 CPUs utilized          
                72      context-switches          #    0,039 K/sec                  
                 1      cpu-migrations            #    0,001 K/sec                  
             4 076      page-faults               #    0,002 M/sec                  
     5 403 949 695      cycles                    #    2,959 GHz                    
       930 473 671      stalled-cycles-frontend   #   17,22% frontend cycles idle   
     9 827 685 690      instructions              #    1,82  insn per cycle         
                                                  #    0,09  stalled cycles per insn
     2 086 725 079      branches                  # 1142,639 M/sec                  
         2 069 655      branch-misses             #    0,10% of all branches        

       1,828334373 seconds time elapsed

sumspeed$ perf stat ./sum_groups < groups_sorted
2496546045

 Performance counter stats for './sum_groups':

       3186,100661      task-clock (msec)         #    1,000 CPUs utilized          
                 5      context-switches          #    0,002 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             4 079      page-faults               #    0,001 M/sec                  
     9 424 565 623      cycles                    #    2,958 GHz                    
     4 955 937 177      stalled-cycles-frontend   #   52,59% frontend cycles idle   
     9 829 009 511      instructions              #    1,04  insn per cycle         
                                                  #    0,50  stalled cycles per insn
     2 086 942 109      branches                  #  655,014 M/sec                  
         2 078 204      branch-misses             #    0,10% of all branches        

       3,186768174 seconds time elapsed

Only one thing stands out: stalled-cycles-frontend.

Okay, the instruction pipeline is stalling. In the frontend. Exactly what that means probably varies between microarchictectures.

I have a guess, though. If you're generous, you might even call it a hypothesis.

Hypothesis

By sorting the input, you're increasing locality of the writes. In fact, they will be very local; almost all additions you do will write to the same location as the previous one.

That's great for the cache, but not great for the pipeline. You're introducing data dependencies, preventing the next addition instruction from proceeding until the previous addition has completed (or has otherwise made the result available to succeeding instructions)

That's your problem.

I think.

Fixing it

Multiple sum vectors

Actually, let's try something. What if we used multiple sum vectors, switching between them for each addition, and then summed those up at the end? It costs us a little locality, but should remove the data dependencies.

(the code is not pretty; don't judge me, internet!!)

#include <iostream>
#include <chrono>
#include <vector>

#ifndef NSUMS
#define NSUMS (4) // must be power of 2 (for masking to work)
#endif

// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int** p_out) {
  for (size_t i = 0; i < n; ++i) {
    p_out[i & (NSUMS-1)][p_g[i]] += p_x[i];
  }
}

int main() {
  std::vector<int> values;
  std::vector<int> groups;
  std::vector<int> sums[NSUMS];

  int n_groups = 0;

  // Read in the values and calculate the max number of groups
  while(std::cin) {
    int value, group;
    std::cin >> value >> group;
    values.push_back(value);
    groups.push_back(group);
    if (group >= n_groups) {
      n_groups = group+1;
    }
  }
  for (int i=0; i<NSUMS; ++i) {
    sums[i].resize(n_groups);
  }

  // Time grouped sums
  std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
  int* sumdata[NSUMS];
  for (int i = 0; i < NSUMS; ++i) {
    sumdata[i] = sums[i].data();
  }
  for (int i = 0; i < 1000; ++i) {
    grouped_sum(values.data(), groups.data(), values.size(), sumdata);
  }
  for (int i = 1; i < NSUMS; ++i) {
    for (int j = 0; j < n_groups; ++j) {
      sumdata[0][j] += sumdata[i][j];
    }
  }
  std::chrono::system_clock::time_point end = std::chrono::system_clock::now();

  std::cout << (end - start).count() << " with NSUMS=" << NSUMS << std::endl;

  return 0;
}

(oh, and I also fixed the n_groups calculation; it was off by one.)

Results

After configuring my makefile to give an -DNSUMS=... arg to the compiler, I could do this:

sumspeed$ for n in 1 2 4 8 128; do make -s clean && make -s NSUMS=$n && (perf stat ./sum_groups < groups_shuffled && perf stat ./sum_groups < groups_sorted)  2>&1 | egrep '^[0-9]|frontend'; done
1134557008 with NSUMS=1
       924 611 882      stalled-cycles-frontend   #   17,13% frontend cycles idle   
2513696351 with NSUMS=1
     4 998 203 130      stalled-cycles-frontend   #   52,79% frontend cycles idle   
1116188582 with NSUMS=2
       899 339 154      stalled-cycles-frontend   #   16,83% frontend cycles idle   
1365673326 with NSUMS=2
     1 845 914 269      stalled-cycles-frontend   #   29,97% frontend cycles idle   
1127172852 with NSUMS=4
       902 964 410      stalled-cycles-frontend   #   16,79% frontend cycles idle   
1171849032 with NSUMS=4
     1 007 807 580      stalled-cycles-frontend   #   18,29% frontend cycles idle   
1118732934 with NSUMS=8
       881 371 176      stalled-cycles-frontend   #   16,46% frontend cycles idle   
1129842892 with NSUMS=8
       905 473 182      stalled-cycles-frontend   #   16,80% frontend cycles idle   
1497803734 with NSUMS=128
     1 982 652 954      stalled-cycles-frontend   #   30,63% frontend cycles idle   
1180742299 with NSUMS=128
     1 075 507 514      stalled-cycles-frontend   #   19,39% frontend cycles idle   

The optimal number of sum vectors will probably depend on your CPU's pipeline depth. My 7-year old ultrabook CPU can probably max out the pipeline with fewer vectors than a new fancy desktop CPU would need.

Clearly, more is not necessarily better; when I went crazy with 128 sum vectors, we started suffering more from cache misses -- as evidenced by the shuffled input becoming slower than sorted, like you had originally expected. We've come full circle! :)

Per-group sum in register

(this was added in an edit)

Agh, nerd sniped! If you know your input will be sorted and are looking for even more performance, the following rewrite of the function (with no extra sum arrays) is even faster, at least on my computer.

// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
  int i = n-1;
  while (i >= 0) {
    int g = p_g[i];
    int gsum = 0;
    do {
      gsum += p_x[i--];
    } while (i >= 0 && p_g[i] == g);
    p_out[g] += gsum;
  }
}

The trick in this one is that it allows the compiler to keep the gsum variable, the sum of the group, in a register. I'm guessing (but may be very wrong) that this is faster because the feedback loop in the pipeline can be shorter here, and/or fewer memory accesses. A good branch predictor will make the extra check for group equality cheap.

Results

It's terrible for shuffled input...

sumspeed$ time ./sum_groups < groups_shuffled
2236354315

real    0m2.932s
user    0m2.923s
sys 0m0.009s

...but is around 40% faster than my "many sums" solution for sorted input.

sumspeed$ time ./sum_groups < groups_sorted
809694018

real    0m1.501s
user    0m1.496s
sys 0m0.005s

Lots of small groups will be slower than a few big ones, so whether or not this is the faster implementation will really depend on your data here. And, as always, on your CPU model.

Multiple sums vectors, with offset instead of bit masking

Sopel suggested four unrolled additions as an alternative to my bit masking approach. I've implemented a generalized version of their suggestion, which can handle different NSUMS. I'm counting on the compiler unrolling the inner loop for us (which it did, at least for NSUMS=4).

#include <iostream>
#include <chrono>
#include <vector>

#ifndef NSUMS
#define NSUMS (4) // must be power of 2 (for masking to work)
#endif

#ifndef INNER
#define INNER (0)
#endif
#if INNER
// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int** p_out) {
  size_t i = 0;
  int quadend = n & ~(NSUMS-1);
  for (; i < quadend; i += NSUMS) {
    for (int k=0; k<NSUMS; ++k) {
      p_out[k][p_g[i+k]] += p_x[i+k];
    }
  }
  for (; i < n; ++i) {
    p_out[0][p_g[i]] += p_x[i];
  }
}
#else
// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int** p_out) {
  for (size_t i = 0; i < n; ++i) {
    p_out[i & (NSUMS-1)][p_g[i]] += p_x[i];
  }
}
#endif


int main() {
  std::vector<int> values;
  std::vector<int> groups;
  std::vector<int> sums[NSUMS];

  int n_groups = 0;

  // Read in the values and calculate the max number of groups
  while(std::cin) {
    int value, group;
    std::cin >> value >> group;
    values.push_back(value);
    groups.push_back(group);
    if (group >= n_groups) {
      n_groups = group+1;
    }
  }
  for (int i=0; i<NSUMS; ++i) {
    sums[i].resize(n_groups);
  }

  // Time grouped sums
  std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
  int* sumdata[NSUMS];
  for (int i = 0; i < NSUMS; ++i) {
    sumdata[i] = sums[i].data();
  }
  for (int i = 0; i < 1000; ++i) {
    grouped_sum(values.data(), groups.data(), values.size(), sumdata);
  }
  for (int i = 1; i < NSUMS; ++i) {
    for (int j = 0; j < n_groups; ++j) {
      sumdata[0][j] += sumdata[i][j];
    }
  }
  std::chrono::system_clock::time_point end = std::chrono::system_clock::now();

  std::cout << (end - start).count() << " with NSUMS=" << NSUMS << ", INNER=" << INNER << std::endl;

  return 0;
}

Results

Time to measure. Note that since I was working in /tmp yesterday, I don't have the exact same input data. Hence, these results aren't directly comparable to the previous ones (but probably close enough).

sumspeed$ for n in 2 4 8 16; do for inner in 0 1; do make -s clean && make -s NSUMS=$n INNER=$inner && (perf stat ./sum_groups < groups_shuffled && perf stat ./sum_groups < groups_sorted)  2>&1 | egrep '^[0-9]|frontend'; done; done1130558787 with NSUMS=2, INNER=0
       915 158 411      stalled-cycles-frontend   #   16,96% frontend cycles idle   
1351420957 with NSUMS=2, INNER=0
     1 589 408 901      stalled-cycles-frontend   #   26,21% frontend cycles idle   
840071512 with NSUMS=2, INNER=1
     1 053 982 259      stalled-cycles-frontend   #   23,26% frontend cycles idle   
1391591981 with NSUMS=2, INNER=1
     2 830 348 854      stalled-cycles-frontend   #   45,35% frontend cycles idle   
1110302654 with NSUMS=4, INNER=0
       890 869 892      stalled-cycles-frontend   #   16,68% frontend cycles idle   
1145175062 with NSUMS=4, INNER=0
       948 879 882      stalled-cycles-frontend   #   17,40% frontend cycles idle   
822954895 with NSUMS=4, INNER=1
     1 253 110 503      stalled-cycles-frontend   #   28,01% frontend cycles idle   
929548505 with NSUMS=4, INNER=1
     1 422 753 793      stalled-cycles-frontend   #   30,32% frontend cycles idle   
1128735412 with NSUMS=8, INNER=0
       921 158 397      stalled-cycles-frontend   #   17,13% frontend cycles idle   
1120606464 with NSUMS=8, INNER=0
       891 960 711      stalled-cycles-frontend   #   16,59% frontend cycles idle   
800789776 with NSUMS=8, INNER=1
     1 204 516 303      stalled-cycles-frontend   #   27,25% frontend cycles idle   
805223528 with NSUMS=8, INNER=1
     1 222 383 317      stalled-cycles-frontend   #   27,52% frontend cycles idle   
1121644613 with NSUMS=16, INNER=0
       886 781 824      stalled-cycles-frontend   #   16,54% frontend cycles idle   
1108977946 with NSUMS=16, INNER=0
       860 600 975      stalled-cycles-frontend   #   16,13% frontend cycles idle   
911365998 with NSUMS=16, INNER=1
     1 494 671 476      stalled-cycles-frontend   #   31,54% frontend cycles idle   
898729229 with NSUMS=16, INNER=1
     1 474 745 548      stalled-cycles-frontend   #   31,24% frontend cycles idle   

Yup, the inner loop with NSUMS=8 is the fastest on my computer. Compared to my "local gsum" approach, it also has the added benefit of not becoming terrible for the shuffled input.

Interesting to note: NSUMS=16 becomes worse than NSUMS=8. This could be because we're starting to see more cache misses, or because we don't have enough registers to unroll the inner loop properly.


Here is why sorted groups slower than unsored groups;

First here is the assembly code for summing loop:

008512C3  mov         ecx,dword ptr [eax+ebx]
008512C6  lea         eax,[eax+4]
008512C9  lea         edx,[esi+ecx*4] // &sums[groups[i]]
008512CC  mov         ecx,dword ptr [eax-4] // values[i]
008512CF  add         dword ptr [edx],ecx // sums[groups[i]]+=values[i]
008512D1  sub         edi,1
008512D4  jne         main+163h (08512C3h)

Lets look at the add instruction which is the main reason for this issue;

008512CF  add         dword ptr [edx],ecx // sums[groups[i]]+=values[i]

When processor executes this instruction first it will issue a memory read (load) request to the address in edx then add the value of ecx then issue write (store) request for the same address.

there is a feature in processor caller memory reorder

To allow performance optimization of instruction execution, the IA-32 architecture allows departures from strongordering model called processor ordering in Pentium 4, Intel Xeon, and P6 family processors. These processorordering variations (called here the memory-ordering model) allow performance enhancing operations such as allowing reads to go ahead of buffered writes. The goal of any of these variations is to increase instruction execution speeds, while maintaining memory coherency, even in multiple-processor systems.

and there is a rule

Reads may be reordered with older writes to different locations but not with older writes to the same location.

So if the next iteration reaches the add instruction before the write request completes it will not wait if the edx address is different than the previous value and issue the read request and it reordered over the older write request and the add instruction continue. but if the address is the same the add instruction will wait until old write is done.

Note that the loop is short and the processor can execute it faster than memory controller completes the write to memory request.

so for sorted groups you will read and write from same address many time consecutively so it will lose the performance enhancement using memory reorder; meanwhile if random groups used then each iteration will have probably different address so the read will not wait older write and reordered before it; add instruction will not wait previous one to go.