Does aligning memory on particular address boundaries in C/C++ still improve x86 performance?

The penalties are usually small, but crossing a 4k page boundary on Intel CPUs before Skylake has a large penalty (~150 cycles). How can I accurately benchmark unaligned access speed on x86_64 has some details on the actual effects of crossing a cache-line boundary or a 4k boundary. (This applies even if the load / store is inside one 2M or 1G hugepage, because the hardware can't know that until after it's started the process of checking the TLB twice.) e.g in an array of double that was only 4-byte aligned, at a page boundary there'd be one double that was split evenly across two 4k pages. Same for every cache-line boundary.

Regular cache-line splits that don't cross a 4k page cost ~6 extra cycles of latency on Intel (total of 11c on Skylake, vs. 4 or 5c for a normal L1d hit), and cost extra throughput (which can matter in code that normally sustains close to 2 loads per clock.)

Misalignment without crossing a 64-byte cache-line boundary has zero penalty on Intel. On AMD, cache lines are still 64 bytes, but there are relevant boundaries within cache lines at 32 bytes and maybe 16 on some CPUs.

Should I align every stack variable?

No, the compiler already does that for you. x86-64 calling conventions maintain a 16-byte stack alignment so they can get any alignment up to that for free, including 8-byte int64_t and double arrays.

Also remember that most local variables are kept in registers for most of the time they're getting heavy use. Unless a variable is volatile, or you compile without optimization, the value doesn't have to be stored / reloaded between accesses.

The normal ABIs also require natural alignment (aligned to its size) for all the primitive types, so even inside structs and so on you will get alignment, and a single primitive type will never span a cache-line boundary. (exception: i386 System V only requires 4 byte alignment for int64_t and double. Outside of structs, the compiler will choose to give them more alignment, but inside structs it can't change the layout rules. So declare your structs in an order that puts the 8-byte members first, or at least laid out so they get 8-byte alignment. Maybe use alignas(8) on such struct members if you care about 32-bit code, if there aren't already any members that require that much alignment.)

The x86-64 System V ABI (all non-Windows platforms) requires aligning arrays by 16 if they have automatic or static storage outside of a struct. maxalign_t is 16 on x86-64 SysV so malloc / new return 16-byte aligned memory for dynamic allocation. gcc targeting Windows also aligns stack arrays if it auto-vectorizes over them in that function.

(If you cause undefined behaviour by violating the ABI's alignment requirements, it often doesn't make any performance different. It usually doesn't cause correctness problems x86, but it can lead to faults for SIMD type, and with auto-vectorization of scalar types. e.g. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?. So if you intentionally misalign data, make sure you don't access it with any pointer wider than char*. e.g. use memcpy(&tmp, buf, 8) with uint64_t tmp to do an unaligned load. gcc can autovectorize through that, IIRC.)

You might sometimes want to alignas(32) or 64 for large arrays, if you compile with AVX or AVX512 enabled. For a SIMD loop over a big array (that doesn't fit in L2 or L1d cache), with AVX/AVX2 (32-byte vectors) there's usually near-zero effect from making sure it's aligned by 32 on Intel Haswell/Skylake. Memory bottlenecks in data coming from L3 or DRAM will give the core's load/store units and L1d cache time to do multiple accesses under the hood, even if every other load/store crosses a cache-line boundary.

But with AVX512 on Skylake-server, there is a significant effect in practice for 64-byte alignment of arrays, even with arrays that are coming from L3 cache or maybe DRAM. I forget the details, it's been a while since I looked at an example, but maybe 10 to 15% even for a memory-bound loop? Every 64-byte vector load and store will cross a 64-byte cache line boundary if they aren't aligned.

Depending on the loop, you can handle under-aligned inputs by doing a first maybe-unaligned vector, then looping over aligned vectors until the last aligned vector. Another possibly-overlapping vector that goes to the end of the array can handle the last few bytes. This works great for a copy-and-process loop where it's ok to re-copy and re-process the same elements in the overlap, but there are other techniques you can use for other cases, e.g. a scalar loop up to an alignment boundary, narrower vectors, or masking. If your compiler is auto-vectorizing, it's up to the compiler to choose. If you're manually vectorizing with intrinsics, you get to / have to choose. If arrays are normally aligned, it's a good idea to just use unaligned loads (which have no penalty if the pointers are aligned at runtime), and let the hardware handle the rare cases of unaligned inputs so you don't have any software overhead on aligned inputs.

Does aligning memory on particular address boundaries in C/C++ still improve x86 performance?

Tags:

Performance

C++

C

X86

Latency

Related

Recent Posts