Aligned and unaligned memory access with AVX/AVX2 intrinsics

There is no way to explicitly control folding of loads with intrinsics. I consider this a weakness of intrinsics. If you want to explicitly control the folding then you have to use assembly.

In previous version of GCC I was able to control the folding to some degree using an aligned or unaligned load. However, that no longer appears to be the case (GCC 4.9.2). I mean for example in the function AddDot4x4_vec_block_8wide here the loads are folded

vmulps  ymm9, ymm0, YMMWORD PTR [rax-256]
vaddps  ymm8, ymm9, ymm8

However in a previous verison of GCC the loads were not folded:

vmovups ymm9, YMMWORD PTR [rax-256]
vmulps  ymm9, ymm0, ymm9
vaddps  ymm8, ymm8, ymm9

The correct solution is, obviously, to only used aligned loads when you know the data is aligned and if you really want to explicitly control the folding use assembly.


In addition to Z boson's answer I can tell that the problem can be caused by that the compiler assumes the memory region is aligned (because of __attribute__ ((aligned(32))) marking the array). In runtime that attribute may not work for values on the stack because the stack is only 16-byte aligned (see this bug, which is still open at the time of this writing, though some fix have made it into gcc 4.6). The compiler is in its rights to choose the instructions to implement intrinsics, so it may or may not fold the memory load into the computational instruction, and it is also in its rights to use vmovaps when the folding does not occur (because, as noted before, the memory region is supposed to be aligned).

You can try forcing the compiler to realign the stack to 32 bytes upon entry in main by specifying -mstackrealign and -mpreferred-stack-boundary=5 (see here) but it will incur a performance overhead.

Tags:

Gcc

Avx

Avx2