What do you do without fast gather and scatter in AVX2 instructions?

IDK why you use different parts of the same cur[8] array for indices and values; it made the source harder to understand to figure out that there was only one real array. The other was just to bounce vectors to scalars.

It looks like you're only ever going vector -> scalar, not inserting scalars back into a vector. And also that nothing inside the loop depends on any data in sieveX[]; I'm not familiar with your sieving algorithm but I guess the point of this is to create data in memory for later use.


AVX2 has gathers (not scatters), but they're only fast on Skylake and newer. They're ok on Broadwell, slowish on Haswell, and slow on AMD. (Like one per 12 clocks for Ryzen's vpgatherqq). See http://agner.org/optimize/ and other performance links in the x86 tag wiki.

Intel's optimization manual has a small section on manual gather / scatter (using insert/extract or movhps) vs. hardware instructions, possibly worth reading. In this case where the indices are runtime variables (not a constant stride or something), I think Skylake can benefit from AVX2 gather instructions here.

See Intel's intrinsics guide to look up the intrinsic for asm instructions like movhps. I'm just talking about what you want to get your compiler to emit, because that's what's important and the asm mnemonics are shorter to type and don't need casting. You have to know the asm mnemonic to look them up in Agner Fog's instruction tables, or to read compiler output from auto-vectorization, so I usually think in asm and then translate that to intrinsics.


With AVX, you have 3 main options:

  • do everything scalar. Register pressure may be a problem, but generating indices as needed (instead of doing all 4 adds or subs to generate curr[4..7] at once) might help. Unless those mask vectors have different values in different elements.

(Using memory sources for scalar constants might not be bad, though, if they don't fit in 32-bit immediates and if you don't bottleneck on 2 memory ops per clock. The memory-destination or instructions would use indexed addressing modes, so the dedicated store-AGU on port 7 on Haswell and later couldn't be used. Thus AGU throughput could be a bottleneck.)

Extracting all 4 elements of a vector as scalar is more expensive than 4x scalar add or shift instructions, but you're doing more work than that. Still, with BMI2 for 1 uops variable-count shifts (instead of 3 on Intel), it might not be terrible. I think we can do better with SIMD, though, especially with careful tuning.

  • extract indices and values to scalar like you're doing now, so the OR into sieveX[] is pure scalar. Works even when two or more indices are the same.

    This costs you about 7 uops per ymm vector -> 4x scalar registers using extract ALU instructions, or 5 uops using store/reload (worth considering for the compiler, maybe for one or two of the 4 vector extracts, because this code probably doesn't manage to bottleneck on load / store port throughput.) If the compiler turns store/reload in the C source into shuffle/extract instructions, though, you can't easily override its strategy except maybe by using volatile. And BTW, you'd want to use alignas(32) cur[8] to make sure actual vector stores don't cross a cache-line boundary.

or [rdi + rax*8], rdx (with an indexed addressing mode preventing full micro-fusion) is 3 uops on modern Intel CPUs (Haswell and later). We could avoid an indexed addressing mode (making it 2 uops for the front-end) by scaling + adding to the array base address using SIMD: e.g. srli by 3 instead of 6, mask off the low 3 bits (vpand), and vpaddq with set1_epi64(sieveX). So this costs 2 extra SIMD instructions to save 4 uops on SnB-family, per vector of indices. (You'd extracting uint64_t* pointer elements instead of uint64_t indices. Or if sieveX can be a 32-bit absolute address1, you could skip the vpaddq and extract already-scaled indices for the same gain.)

It would also enable the store-address uops to run on port 7 (Haswell and later); the simple AGU on port7 can only handle non-indexed addressing modes. (This makes extracting values to scalar with store+reload more attractive. You want lower latency for extracting indices, because the values aren't needed until after the load part of a memory-dst or completes.) It does mean more unfused-domain uops for the scheduler / execution units, but could well be worth the tradeoff.

This isn't a win on other AVX2 CPUs (Excavator / Ryzen or Xeon Phi); only SnB-family has a front-end cost and execution-port restrictions for indexed addressing modes.

  • extract indices, manually gather into a vector with vmovq / vmovhps for a SIMD vpor, then scatter back with vmovq / vmovhps.

    Just like a HW gather/scatter, correctness requires that all indices are unique, so you'll want to use one of the above options until you get to that point in your algo. (vector conflict detection + fallback would not be worth the cost vs. just always extracting to scalar: Fallback implementation for conflict detection in AVX2).

    See selectively xor-ing elements of a list with AVX2 instructions for an intrinsics version. (I knew I'd recently written an answer with a manual gather / scatter, but took me a while to find it!) In that case I only used 128-bit vectors because there wasn't any extra SIMD work to justify the extra vinserti128 / vextracti128.

Actually I think here you'd want to extract the high half of the _mm256_sllv_epi64 result so you have (the data that would be) cur[4..5] and cur[6..7] in two separate __m128i variables. You'd have vextracti128 / 2x vpor xmm instead of vinserti128 / vpor ymm / vextracti128.

The former has less port5 pressure, and has better instruction-level parallelism: The two 128-bit halves are separate dependency chains that don't get coupled to each other, so store/reload bottlenecks (and cache misses) impact fewer dependent uops, allowing out-of-order execution to keep working on more stuff while waiting.

Doing address calculation in a 256b vector and extracting pointers instead of indices would make vmovhps loads cheaper on Intel (indexed loads can't stay micro-fused to vmovhps2). See the previous bullet point. But vmovq loads/stores are always a single uop, and vmovhps indexed stores can stay micro-fused on Haswell and later, so it's break-even for front-end throughput and worse on AMD or KNL. It also means more unfused-domain uops for the scheduler / execution units, which looks like more of a potential bottleneck than port2/3 AGU pressure. The only advantage is that the store-address uops can run on port 7, relieving some pressure.

AVX2 gives us one new option:

  • AVX2 vpgatherqq for the gather (_mm256_i64gather_epi64(sieveX, srli_result, 8)), then extract indices and manually scatter. So it's exactly like the manual gather / manual scatter, except you replace the manual gather with an AVX2 hardware gather. (Two 128-bit gathers cost more than one 256-bit gather, so you would want to take the instruction-level parallelism hit and gather into a single 256-bit register).

Possibly a win on Skylake (where vpgatherqq ymm is 4 uops / 4c throughput, plus 1 uop of setup), but not even Broadwell (9 uops, one per 6c throughput) and definitely not Haswell (22 uops / 9c throughput). You do need the indices in scalar registers anyway, so you're only saving the manual-gather part of the work. That's pretty cheap.


Total cost for each strategy on Skylake

It looks like this won't bottleneck badly on any one port. GP reg->xmm needs port 5, but xmm->int needs port 0 on SnB-family CPUs, so it's less likely to bottleneck on port 5 when mixed with the shuffles needed for extracting. (e.g. vpextrq rax, xmm0, 1 is a 2 uop instruction, one port 5 shuffle uop to grab the high qword, and a port 0 uop to send that data from SIMD to the integer domain.)

So your SIMD calculation where you need to frequently extract a vector to scalar is less bad than if you needed to frequently insert scalar calculation results into vectors. See also Loading an xmm from GP regs, but that's talking about data that starts in GP regs, not memory.

  • extract both / scalar OR: Total = 24 uops = 6 cycles of front-end throughput.

  • vpaddq + vpand address calc (2 uops for port 0/1/5 on Skylake)

  • 2x vextracti128 (2 uops for port 5)

  • 4x vmovq (4 p0)

  • 4x vpextrq (8: 4p0 4p5)

  • 4x or [r], r (4x2 = 8 front-end uops each. backend: 4p0156 4p23 (load) 4p237 (store-addres) 4p4 (store-data)). Non-indexed addressing mode.

Total = 6 uops for p5, just barely fits. Store/reload for a data extract looks sensible, if you could get your compiler to do that. (But compilers don't typically model the pipeline in enough detail to use a mix of strategies in the same loop to balance port pressure.)

  • manual gather/scatter: 20 uops, 5 cycles of front-end throughput (Haswell / BDW / Skylake). Also good on Ryzen.

  • (optional, probably not worth it): vpaddq + vpand address calc (2 uops for port 0/1/5 on Skylake) Skip these if you could use non-VEX movhps for a 1-uop micro-fused indexed load. (But then p237 stores become p23).

  • vextracti128 pointers (1 uop for port 5)

  • 2x vmovq extract (2p0)

  • 2x vpextrq (4 = 2p0 2p5)

  • 2x vmovq load (2p23)

  • 2x vmovhps xmm, xmm, [r] non-indexed load (2 front-end uops micro-fused: 2p23 + 2p5)

  • vextracti128 split the data (p5)

  • 2x vpor xmm (2p015)

  • 2x vmovq store (2x 1 micro-fused uop, 2p237 + 2p4)

  • 2x vmovhps store (2x 1 micro-fused uop, 2p237 + 2p4)

Port bottlenecks: 4 p0 and 4 p5 fits comfortably in 5 cycles, especially when you mix this with your loop which can run several of its uops on port 1. On Haswell paddq is only p15 (not p015), and shifts are only p0 (not p01). AVX2 _mm256_sllv_epi64 is 1 uop (p01) on Skylake, but on Haswell it's 3 uops = 2p0 + p5. So Haswell might be closer to a p0 or p5 bottleneck for this loop, in which case you might want to look at a store/reload extract strategy for one vector of indices.

Skipping the SIMD address calc is probably good, because AGU pressure doesn't look like a problem unless you use a store/reload extract. And it means fewer instruction / smaller code-size and fewer uops in the uop cache. (un-lamination doesn't happen until after the decoders / uop cache, so you still benefit from micro-fusion in the early parts of the front-end, just not at the issue bottleneck.)

  • Skylake AVX2 gather / manual scatter: Total = 18 uops, 4.5 cycles of front-end throughput. (Worse on any earlier uarch or on AMD).

  • vextracti128 indices (1 uop for port 5)

  • 2x vmovq extract (2p0)

  • 2x vpextrq (4 = 2p0 2p5)

  • vpcmpeqd ymm0,ymm0,ymm0 create an all-ones mask for vpgatherqq (p015)

  • vpgatherqq ymm1, [rdi + ymm2*8], ymm0 4 uops for some ports.

  • vpor ymm (p015)

  • vextracti128 on the OR result (p5)

  • 2x vmovq store (2x 1 micro-fused uop, 2p23 + 2p4). Note no port7, we're using indexed stores.

  • 2x vmovhps store (2x 1 micro-fused uop, 2p23 + 2p4).

So even with the best throughput choice, we're still only managing 4 loads / 4 stores per 4.5 cycles, and that's without considering the SIMD work in the loop which costs some front-end throughput. So we're not close to bottlenecking on AGU throughput and having to worry about using port 7.

We could maybe think about store/reload for one of the extracts (if we were the compiler), replacing the 7 uop 5 instruction vextracti128 / 2x vmovq / 2x vpextrq sequence with a 5 uops store / 4x load.


Overall: One loop until we're done with conflicts, then a SIMD gather loop

You say that after a certain point, you don't have conflicts (overlap) between the indices like cur[0] == cur[2].

You definitely want a separate loop that doesn't check for conflicts at all to take advantage of this. Even if you had AVX512, Skylake's vpconflictq is micro-code and not fast. (KNL has single-uop vpconflictq but it's still faster to avoid it entirely).

I'll leave it up to you (or a separate question) how to figure out for sure when you're done with conflicts and can leave the loop that accounts for that possibility.

You probably want the extract indices + data strategy while there can be conflicts. SIMD conflict checking is possible, but it's not cheap, 11 uops for 32-bit elements: Fallback implementation for conflict detection in AVX2. A qword version is obviously much cheaper than dword (fewer shuffles and compares to get all against all), but you probably still only want to do it every 10 iterations or so of your extract loop.

There's not a huge speedup from the best scalar-or version to the best gather version (6 cycles vs. 4.5 isn't accounting for the other work in the loop, so the ratio is even smaller than that). Leaving the slightly slower version ASAP is not worth making it a lot slower.

So if you can reliably detect when you're done with conflicts, use something like

int conflictcheck = 10;

do {

    if (--conflictcheck == 0) {
       vector stuff to check for conflicts
       if (no conflicts now or in the future)
           break;

       conflictcheck = 10;  // reset the down-counter
    }

    main loop body,  extract -> scalar OR strategy

} while(blah);


// then fall into the gather/scatter loop.
do {
    main loop body, gather + manual scatter strategy
} while();

That should compile to a dec / je which only costs 1 uop in the not-taken case.

Doing 9 extra iterations total of the slightly-slower loop is much better than doing thousands of extra expensive conflict checks.


Footnote 1:

If sieveX is static and you're building non-PIC code on Linux (not MacOS) then its address will fit in a disp32 as part of a [reg+disp32] addressing mode. In that case you can leave out the vpaddq. But getting a compiler to treat a uint64_t as an already-scaled array index (with its low bits cleared) would be ugly. Probably have to cast sieveX to uintptr_t and add, then cast back.

This isn't possible in a PIE executable or shared library (where 32-bit absolute addresses aren't allowed), or on OS X at all (where static addresses are always above 2^32). I'm not sure what Windows allows. Note that [disp32 + reg*8] only has 1 register, but is still an indexed addressing mode so all the SnB-family penalties apply. But if you don't need scaling, reg + disp32 is just base + disp32.

Footnote 2: Fun fact: non-VEX movhps loads can stay micro-fused on Haswell. It won't cause an SSE/AVX stall on Skylake, but you won't get a compiler to emit the non-VEX version in the middle of an AVX2 function.

IACA (Intel's static analysis tool) gets this wrong, though. :( What is IACA and how do I use it?.

This is basically a missed-optimization for -mtune=skylake, but it would stall on Haswell: Why is this SSE code 6 times slower without VZEROUPPER on Skylake?.

The "penalty A" (execute SSE with dirty upper) on Skylake is merely a false dependency on that one register. (And a merging uop for instructions that would otherwise be write-only, but movhps is already a read-modify-write of its destination.) I tested this on Skylake with Linux perf to count uops, with this loop:

    mov     r15d, 100000000

.loop:
    vpaddq  ymm0, ymm1, ymm2      ; dirty the upper part
    vpaddq  ymm3, ymm1, ymm2      ; dirty another register for good measure

    vmovq  xmm0, [rdi+rbx*8]       ; zero the full register, breaking dependencies
    movhps xmm0, [rdi+rbx*8+8]     ; RMW the low 128 bits
                          ; fast on Skylake, will stall on Haswell

    dec r15d
    jnz .loop

The loop runs at ~1.25 cycles per iteration on Skylake (i7-6700k), maxing out the front-end throughput of 4 uops per clock. 5 total fused-domain uops (uops_issued.any), 6 unfused-domain uops (uops_executed.thread). So micro-fusion was definitely happening for movhps without any SSE/AVX problems.

Changing it to vmovhps xmm0, xmm0, [rdi+rbx*8+8] slowed it down to 1.50 cycles per iteration, now 6 fused-domain, but still the same 6 unfused-domain uops.

There's no extra uop if the upper half of ymm0 is dirty when movhps xmm0, [mem] runs. I tested by commenting out the vmovq. But changing vmovq to movq does result in an extra uop: movq becomes a micro-fused load+merge that replaces the low 64 bits (and still zeros the upper 64 bits of xmm0 so it's not quite movlps).


Also note that pinsrq xmm0, [mem], 1 can't micro fuse even without VEX. But with VEX, you should prefer vmovhps for code-size reasons.

Your compiler may want to "optimize" the intrinsic for movhps on integer data into vpinsrq, though, I didn't check.


I just looked at exactly what you're doing here: For the mod1 = mod3 = _mm256_set1_epi64x(1); case, you're just setting single bits in a bitmap with elements of ans as the index.

And it's unrolled by two, with ans and ans2 running in parallel, using mod1 << ans and mod3 << ans2. Comment your code and explain what's going on in the big picture using English text! This is just a very complicated implementation of the bit-setting loop of a normal Sieve of Eratosthenes. (So it would have been nice if the question had said so in the first place.)

Unrolling with multiple start/strides in parallel is a very good optimization, so you normally set multiple bits in a cache line while it's still hot in L1d. Cache-blocking for fewer different factors at once has similar advantages. Iterate over the same 8kiB or 16kiB chunk of memory repeatedly for multiple factors (strides) before moving on to the next. Unrolling with 4 offsets for each of 2 different strides could be a good way to create more ILP.

The more strides you run in parallel, the slower you go through new cache lines the first time you touch them, though. (Giving cache / TLB prefetch room to avoid even an initial stall). So cache blocking doesn't remove all the benefit of multiple strides.


Possible special case for strides < 256

A single 256-bit vector load/VPOR/store can set multiple bits. The trick is creating a vector constant, or set of vector constants, with bits in the right position. The repeating pattern is something like LCM(256, bit_stride) bits long, though. For example stride=3 would repeat in a pattern that's 3 vectors long. This very quickly gets unusable for odd / prime strides unless there's something more clever :(

64-bit scalar is interesting because bitwise rotate is available to create a sequence of patterns, but variable-count rotate on SnB-family CPUs costs 2 uops.

There might be more you can do with this; maybe unaligned loads could help somehow.

A repeating pattern of bitmasks could be useful even for the large-stride case, e.g. rotating by stride % 8 every time. But that would be more useful if you were JITing a loop that hard-coded the pattern into or byte [mem], imm8, with an unroll factor chosen to be congruent with repeat length.


Conflict reduction with narrower loads/stores

You don't have to load/modify/store 64-bit chunks when you're only setting a single bit. The narrower your RMW operations, the closer your bit-indices can be without conflicting.

(But you don't have a long loop-carried dep chain on the same location; you will move on before OoO exec stalls waiting for a reloads at the end of a long chain. So if conflicts aren't a correctness problem, it's unlikely to make a big perf difference here. Unlike a bitmap histogram or something where a long chain of repeated hits on nearby bits could easily happen.)

32-bit elements would be an obvious choice. x86 can efficiently load/store dwords to/from SIMD registers as well as scalar. (scalar byte ops are efficient, too, but byte stores from SIMD regs always require multiple uops with pextrb.)

If you're not gathering into SIMD registers, the SIMD element width for ans / ans2 doesn't have to match the RMW width. 32-bit RMW has advantages over 8-bit if you want to split a bit-index into address / bit-offset in scalar, using shifts or bts that implicitly mask the shift count to 32 bits (or 64 bits for 64-bit shifts). But 8-bit shlx or bts doesn't exist.

The main advantage of using 64-bit SIMD elements is if you're calculating a pointer instead of just an index. If you could restrict your sieveX to 32 bits you'd still be able to do this. e.g. allocate with mmap(..., MAP_32BIT|MAP_ANONYMOUS, ...) on Linux. That's assuming you don't need more than 2^32 bits (512MiB) sieve space, so your bit indices always fit in 32-bit elements. If that's not the case, you could still use 32-bit element vectors up to that point, then use your current loop for the high part.

If you use 32-bit SIMD elements without restricting sieveX to be a 32-bit point pointer, you'd have to give up on using SIMD pointer calculations and just extract a bit-index, or still split in SIMD into idx/bit and extract both.

(With 32-bit elements, a SIMD -> scalar strategy based on store/reload looks even more attractive, but in C that's mostly up to your compiler.)

If you were manually gathering into 32-bit elements, you couldn't use movhps anymore. You'd have to use pinsrd / pextrd for the high 3 elements, and those never micro-fuse / always need a port5 uop on SnB-family. (Unlike movhps which is a pure store). But that means vpinsrd is still 2 uops with an indexed addressing mode. You could still use vmovhps for element 2 (then overwrite the top dword of the vector with vpinsrd); unaligned loads are cheap and it's ok to overlap the next element. But you can't do movhps stores, and that's where it was really good.


There are two big performance problems with your current strategy:

Apparently you're sometimes using this with some elements of mod1 or mod3 being 0, resulting in completely useless wasted work, doing [mem] |= 0 for those strides.

I think once an element in ans or ans2 reaches total, you're going to fall out of the inner loop and do ans -= sum 1 every time through the inner loop. You don't necessarily want to reset it back ans = sum (for that element) to redo the ORing (setting bits that were already set), because that memory will be cold in cache. What we really want is to pack the remaining still-in-use elements into known locations and enter other versions of the loop that only do 7, then 6, then 5 total elements. Then we're down to only 1 vector.

That seems really clunky. A better strategy for one element hitting the end might be to finish the remaining three in that vector with scalar, one at a time, then run the remaining single __m256i vector. If the strides are all nearby, you probably get good cache locality.


Cheaper scalar, or maybe still SIMD but extract only a bit-index

Splitting the bit-index into a qword index and a bitmask with SIMD and then extracting both separately costs a lot of uops for the scalar-OR case: so many that you're not bottlenecking on 1-per-clock store throughput, even with all the optimizations in my scatter/gather answer. (Cache misses may slow this down sometimes, but fewer front-end uops means a larger out-of-order window to find parallelism and keep more memory ops in flight.)

If we can get the compiler to make good scalar code to split a bit-index, we could consider pure scalar. Or at least extracting only bit-indices and skipping the SIMD shift/mask stuff.

It's too bad scalar memory-destination bts is not fast. bts [rdi], rax would set that bit in the bit-string, even if that's outside the dword selected by [rdi]. (That kind of crazy-CISC behaviour is why it's not fast, though! like 10 uops on Skylake.)

Pure scalar may not be ideal, though. I was playing around with this on Godbolt:

#include <immintrin.h>
#include <stdint.h>
#include <algorithm>

// Sieve the bits in array sieveX for later use
void sieveFactors(uint64_t *sieveX64, unsigned cur1, unsigned cur2, unsigned factor1, unsigned factor2)
{
    const uint64_t totalX = 5000000;
#ifdef USE_AVX2
//...
#else
     //uint64_t cur = 58;
     //uint64_t cur2 = 142;
     //uint64_t factor = 67;
     uint32_t *sieveX = (uint32_t*)sieveX64;

    if (cur1 > cur2) {
        // TODO: if factors can be different, properly check which will end first
        std::swap(cur1, cur2);
        std::swap(factor1, factor2);
    }
    // factor1 = factor2;  // is this always true?

    while (cur2 < totalX) {
         sieveX[cur1 >> 5] |= (1U << (cur1 & 0x1f));
         sieveX[cur2 >> 5] |= (1U << (cur2 & 0x1f));
         cur1 += factor1;
         cur2 += factor2;
    }
    while (cur1 < totalX) {
         sieveX[cur1 >> 5] |= (1U << (cur1 & 0x1f));
         cur1 += factor1;
    }
#endif
}

Note how I replaced your outer if() to choose between loops with sorting cur1, cur2.

GCC and clang put a 1 in a register outside the loop, and use shlx r9d, ecx, esi inside the loop to do 1U << (cur1 & 0x1f) in a single uop without destroying the 1. (MSVC uses load / BTS / store, but it's clunky with a lot of mov instructions. I don't know how to tell MSVC it's allowed to use BMI2.)

If an indexed addressing mode for or [mem], reg didn't cost an extra uop, this would be great.

The problem is that you need a shr reg, 5 in there somewhere, and that's destructive. Putting 5 in a register and using that to copy+shift the bit-index would be an ideal setup for load / BTS / store, but compilers don't know that optimization it seems.

Optimal(?) scalar split and use of a bit-index

   mov   ecx, 5    ; outside the loop

.loop:
    ; ESI is the bit-index.
    ; Could be pure scalar, or could come from an extract of ans directly

    shrx  edx, esi, ecx           ; EDX = ESI>>5 = dword index
    mov   eax, [rdi + rdx*4]
    bts   eax, esi                ; set esi % 32 in EAX
    mov   [rdi + rdx*4]


    more unrolled iterations

    ; add   esi, r10d               ; ans += factor if we're doing scalar

    ...
    cmp/jb .loop

So given a bit-index in a GP register, that's 4 uops to set the bit in memory. Notice that the load and store are both with mov, so indexed addressing modes have no penalty on Haswell and later.

But the best I could get compilers to make was 5, I think, using shlx / shr / or [mem], reg. (With an indexed addressing mode the or is 3 uops instead of 2.)

I think if you're willing to use hand-written asm, you can go faster with this scalar and ditch SIMD entirely. Conflicts are never a correctness problem for this.

Maybe you can even get a compiler to emit something comparable, but even a single extra uop per unrolled RMW is a big deal.