How to divide 16-bit integer by 255 with using SSE?

Out of curiosity (and if performance is an issue), here is the accuracy of using (val + offset) >> 8 as a substitute for (val / 255) for all 16 bit values up to 255*255 (for example when blending two 8bit values using an 8bit blend factor):

(avrg signed error / avrg abs error / max abs error)
offset    0:  0.49805 / 0.49805 / 1  (just shifting, no offset)
offset 0x7F:  0.00197 / 0.24806 / 1
offest 0x80: -0.00194 / 0.24806 / 1

All other offsets produce both larger signed and avrg errors. So if you can live with an average error of less than 0.25 than use offset+shifting for a small speed increase

// approximate division by 255 for packs of 8 times 16bit values in vals_packed
__m128i offset = _mm_set1_epi16(0x80); // constant
__m128i vals_packed_offest = _mm_add_epi16( vals_packed, offset );
__m128i result_packed = _mm_srli_epi16( vals_packed_offest , 8 );

If you want an exactly correct result for all cases, follow the advice from Marc Glisse's comment on the question Anton linked: SSE integer division?

Use GNU C native vector syntax to express division of a vector by your given scalar, and see what it does on the Godbolt compiler explorer:

Unsigned division is cheap:

typedef unsigned short vec_u16 __attribute__((vector_size(16)));
vec_u16 divu255(vec_u16 x){ return x/255; }  // unsigned division

#gcc5.5 -O3 -march=haswell
divu255:
    vpmulhuw        xmm0, xmm0, XMMWORD PTR .LC3[rip]  # _mm_set1_epi16(0x8081)
    vpsrlw          xmm0, xmm0, 7
    ret

Intrinsics version:

 // UNSIGNED division with intrinsics
__m128i div255_epu16(__m128i x) {
    __m128i mulhi = _mm_mulhi_epu16(x, _mm_set1_epi16(0x8081));
    return _mm_srli_epi16(mulhi, 7);
}

At only 2 uops, this has better throughput (but worse latency) than @ermlg's answer, if you're bottlenecked on front-end throughput, or port 0 throughput on Intel CPUs. (As always, it depends on the surrounding code when you use this as part of a larger function.) http://agner.org/optimize/

Vector shift only runs on port 0 on Intel chips, so @ermlg's 2 shifts + 1 add bottlenecks on port 0. (Again depending on surrounding code). And it's 3 uops vs. 2 for this.

On Skylake, pmulhuw / pmulhw runs on ports 0 or 1, so it can run in parallel with a shift. (But on Broadwell and earlier, they run only on port 0, conflicting with shifts. So the only advantage on Intel pre-Skylake is fewer total uops for the front-end and for out-of-order execution to keep track of.) pmulhuw has 5 cycle latency on Intel, vs. 1 for shifts, but OoO exec can typically hide a few cycles more latency when you can save uops for more throughput.

Ryzen also only runs pmulhuw on its P0, but shifts on P2, so it's excellent for this.

But signed integer division rounding semantics doesn't match shifts

typedef short vec_s16 __attribute__((vector_size(16)));

vec_s16 div255(vec_s16 x){ return x/255; }  // signed division

    ; function arg x starts in xmm0
    vpmulhw xmm1, xmm0, XMMWORD PTR .LC3[rip]  ; a vector of set1(0x8081)
    vpaddw  xmm1, xmm1, xmm0
    vpsraw  xmm0, xmm0, 15       ; 0 or -1 according to the sign bit of x
    vpsraw  xmm1, xmm1, 7        ; shift the mulhi-and-add result
    vpsubw  xmm0, xmm1, xmm0     ; result += (x<0)

.LC3:
        .value  -32639
        .value  -32639
        ; repeated

At the risk of bloating the answer, here it is again with intrinsics:

// SIGNED division
__m128i div255_epi16(__m128i x) {
    __m128i tmp = _mm_mulhi_epi16(x, _mm_set1_epi16(0x8081));
    tmp = _mm_add_epi16(tmp, x);  // There's no integer FMA that's usable here
    x   = _mm_srai_epi16(x, 15);  // broadcast the sign bit
    tmp = _mm_srai_epi16(tmp, 7);
    return _mm_sub_epi16(tmp, x);
}

In the godbolt output, note that gcc is smart enough to use the same 16B constant in memory for the set1 and for the one it generated itself for div255. AFAIK, this works like string-constant merging.


There is an integer approximation of division by 255:

inline int DivideBy255(int value)
{
    return (value + 1 + (value >> 8)) >> 8;
}

So with using of SSE2 it will look like:

inline __m128i DivideI16By255(__m128i value)
{
    return _mm_srli_epi16(_mm_add_epi16(
        _mm_add_epi16(value, _mm_set1_epi16(1)), _mm_srli_epi16(value, 8)), 8);
}

For AVX2:

inline __m256i DivideI16By255(__m256i value)
{
    return _mm256_srli_epi16(_mm256_add_epi16(
        _mm256_add_epi16(value, _mm256_set1_epi16(1)), _mm256_srli_epi16(value, 8)), 8);
}

For Altivec (Power):

typedef __vector int16_t v128_s16;
const v128_s16 K16_0001 = {1, 1, 1, 1, 1, 1, 1, 1};
const v128_s16 K16_0008 = {8, 8, 8, 8, 8, 8, 8, 8};

inline v128_s16 DivideBy255(v128_s16 value)
{
    return vec_sr(vec_add(vec_add(value, K16_0001), vec_sr(value, K16_0008)), K16_0008);
}

For NEON (ARM):

inline int16x8_t DivideI16By255(int16x8_t value)
{
    return vshrq_n_s16(vaddq_s16(
        vaddq_s16(value, vdupq_n_s16(1)), vshrq_n_s16(value, 8)), 8);
}

GCC optimizes x/255 with x is unsigned short to DWORD(x * 0x8081) >> 0x17 which can further be simplified to HWORD(x * 0x8081) >> 7 and finally HWORD((x << 15) + (x << 7) + x) >> 7.

SIMD macros can look like this:

#define MMX_DIV255_U16(x) _mm_srli_pi16(_mm_mulhi_pu16(x, _mm_set1_pi16((short)0x8081)), 7)
#define SSE2_DIV255_U16(x) _mm_srli_epi16(_mm_mulhi_epu16(x, _mm_set1_epi16((short)0x8081)), 7)
#define AVX2_DIV255_U16(x) _mm256_srli_epi16(_mm256_mulhi_epu16(x, _mm256_set1_epi16((short)0x8081)), 7)