FMA3 in GCC: how to enable

Only answering a very small part of the question here. If you write _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0), gcc-4.9 handles it almost like inline asm and does not optimize it much. If you replace it with areg0*breg0+tmp0, a syntax that is supported by both gcc and clang, then gcc starts optimizing and may use FMA if available. I improved that for gcc-5, _mm256_add_ps for instance is now implemented as an inline function that simply uses +, so the code with intrinsics can be optimized as well.


The following compiler options are sufficient to contract _mm256_add_ps(_mm256_mul_ps(a, b), c) to a single fma instruction now (e.g vfmadd213ps):

GCC 5.3:   -O2 -mavx2 -mfma
Clang 3.7: -O1 -mavx2 -mfma -ffp-contract=fast
ICC 13:    -O1 -march=core-avx2

I tried /O2 /arch:AVX2 /fp:fast with MSVC but it still does not contract (surprise surprise). MSVC will contract scalar operations though.

GCC started doing this since at least GCC 5.1.


Although -O1 is sufficient for this optimization with some compilers, always use at least -O2 for overall performance, preferably -O3 -march=native -flto and also profile-guided optimization.

And if it's ok for your code, -ffast-math.

Tags:

C++

Gcc

Intel

Avx

Fma