C++ performance std::array vs std::vector

GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors

Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.

Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on:

[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
        mov     rax, QWORD PTR glob[rip]
        mov     rcx, QWORD PTR v2[rip]
        mov     rdx, QWORD PTR v1[rip]
        movsd   xmm1, QWORD PTR [rax+8]
        movsd   xmm0, QWORD PTR [rax]
        lea     rsi, [rax+784]
.L23:
        movsd   xmm2, QWORD PTR [rcx]
        addsd   xmm2, QWORD PTR [rdx]
        add     rax, 8
        addsd   xmm0, xmm2
        movsd   QWORD PTR [rax-8], xmm0
        movsd   xmm0, QWORD PTR [rcx+8]
        addsd   xmm0, QWORD PTR [rdx+8]
        addsd   xmm0, xmm1
        movsd   QWORD PTR [rax], xmm0
        movsd   xmm1, QWORD PTR [rcx+16]
        addsd   xmm1, QWORD PTR [rdx+16]
        addsd   xmm1, QWORD PTR [rax+8]
        movsd   QWORD PTR [rax+8], xmm1
        cmp     rax, rsi
        jne     .L23
        ret

//=============
//Array Version
//=============
assemble_arr():
        mov     rax, QWORD PTR glob[rip]
        movsd   xmm2, QWORD PTR .LC1[rip]
        movsd   xmm3, QWORD PTR .LC2[rip]
        movsd   xmm1, QWORD PTR [rax+8]
        movsd   xmm0, QWORD PTR [rax]
        lea     rdx, [rax+784]
.L26:
        addsd   xmm1, xmm3
        addsd   xmm0, xmm2
        add     rax, 8
        movsd   QWORD PTR [rax-8], xmm0
        movapd  xmm0, xmm1
        movsd   QWORD PTR [rax], xmm1
        movsd   xmm1, QWORD PTR [rax+8]
        addsd   xmm1, xmm2
        movsd   QWORD PTR [rax+8], xmm1
        cmp     rax, rdx
        jne     .L26
        ret
[-snip-]

There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version.


C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; or v2.

const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage.

Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. For example, the double *data in the control block of glob.

C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vectors doesn't overlap. They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict.


But with const arr a1 {1.0,-1.0,1.0}; and a2, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]); and so on at compile time. In @Xirema's answer, you can see the asm output loads constants .LC1 and .LC2. (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0. The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.)


But couldn't the compiler still do the sums once outside the loop at runtime?

No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2], so it reloads from v1 and v2 every time through the loop after the store into glob.

(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double*.)

The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2, and made a different version of the loop for that case, hoisting the three comb() results out of the loop.

This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.

ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec (on the Godbolt compiler explorer), it load the data pointer from glob, then adds 8 and subtracts the pointer again, producing a constant 8. Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16)

Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4], and 6 addsd (scalar double) add instructions.

Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.


If glob[] had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data() aren't pointing into that static array.

I thought if you accessed it through double *__restrict g = &glob[0];, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0].

In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.)

For testing, I put this on Godbolt

//__attribute__((noinline))
void assemble_vec()
{
     double *__restrict g = &glob[0];   // Helps MSVC, but not gcc/clang/ICC
    // std::vector<double> &g = glob;   // actually hurts ICC it seems?
    // #define g  glob                  // so use this as the alternative to __restrict
    for (size_t i=0; i<N-2; ++i)
    {
        g[i] += comb(v1[0],v2[0]);
        g[i+1] += comb(v1[1],v2[1]);
        g[i+2] += comb(v1[2],v2[2]);
    }  
}

We get this from MSVC outside the loop

    movsd   xmm2, QWORD PTR [rcx]       # v2[0]
    movsd   xmm3, QWORD PTR [rcx+8]
    movsd   xmm4, QWORD PTR [rcx+16]
    addsd   xmm2, QWORD PTR [rax]       # += v1[0]
    addsd   xmm3, QWORD PTR [rax+8]
    addsd   xmm4, QWORD PTR [rax+16]
    mov     eax, 98                             ; 00000062H

Then we get an efficient-looking loop.

So this is a missed-optimization for gcc/clang/ICC.