Getting started with Intel x86 SSE SIMD instructions

First, I don't recommend on using the built-in functions - they are not portable (across compilers of the same arch).

Use intrinsics, GCC does a wonderful job optimizing SSE intrinsics into even more optimized code. You can always have a peek at the assembly and see how to use SSE to it's full potential.

Intrinsics are easy - just like normal function calls:

#include <immintrin.h>  // portable to all x86 compilers

int main()
{
    __m128 vector1 = _mm_set_ps(4.0, 3.0, 2.0, 1.0); // high element first, opposite of C array order.  Use _mm_setr_ps if you want "little endian" element order in the source.
    __m128 vector2 = _mm_set_ps(7.0, 8.0, 9.0, 0.0);

    __m128 sum = _mm_add_ps(vector1, vector2); // result = vector1 + vector 2

    vector1 = _mm_shuffle_ps(vector1, vector1, _MM_SHUFFLE(0,1,2,3));
    // vector1 is now (1, 2, 3, 4) (above shuffle reversed it)
    return 0;
}

Use _mm_load_ps or _mm_loadu_ps to load data from arrays.

Of course there are way more options, SSE is really powerful and in my opinion relatively easy to learn.

See also https://stackoverflow.com/tags/sse/info for some links to guides.


Since you asked for resources:

A practical guide to using SSE with C++: Good conceptual overview on how to use SSE effectively, with examples.

MSDN Listing of Compiler Intrinsics: Comprehensive reference for all your intrinsic needs. It's MSDN, but pretty much all the intrinsics listed here are supported by GCC and ICC as well.

Christopher Wright's SSE Page: Quick reference on the meanings of the SSE opcodes. I guess the Intel Manuals can serve the same function, but this is faster.

It's probably best to write most of your code in intrinsics, but do check the objdump of your compiler's output to make sure that it's producing efficient code. SIMD code generation is still a fairly new technology and it's very possible that the compiler might get it wrong in some cases.

Tags:

C

Gcc

X86

Sse

Simd