Find nth SET bit in an int

v-1 has a zero where v has its least significant "one" bit, while all more significant bits are the same. This leads to the following function:

int ffsn(unsigned int v, int n) {
   for (int i=0; i<n-1; i++) {
      v &= v-1; // remove the least significant bit
   }
   return v & ~(v-1); // extract the least significant bit
}

Nowadays this is very easy with PDEP from the BMI2 instruction set. Here is a 64-bit version with some examples:

#include <cassert>
#include <cstdint>
#include <x86intrin.h>

inline uint64_t nthset(uint64_t x, unsigned n) {
    return _pdep_u64(1ULL << n, x);
}

int main() {
    assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 0) ==
                  0b0000'0000'0000'0000'0000'0000'0010'0000);
    assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 1) ==
                  0b0000'0000'0000'0000'0000'0000'1000'0000);
    assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 3) ==
                  0b0000'0000'0000'0000'0100'0000'0000'0000);
    assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 9) ==
                  0b0000'1000'0000'0000'0000'0000'0000'0000);
    assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 10) ==
                  0b0000'0000'0000'0000'0000'0000'0000'0000);
}

If you just want the (zero-based) index of the nth set bit, add a trailing zero count.

inline unsigned nthset(uint64_t x, unsigned n) {
    return _tzcnt_u64(_pdep_u64(1ULL << n, x));
}

It turns out that it is indeed possible to do this with no loops. It is fastest to precompute the (at least) 8 bit version of this problem. Of course, these tables use up cache space, but there should still be a net speedup in virtually all modern pc scenarios. In this code, n=0 returns the least set bit, n=1 is second-to-least, etc.

Solution with __popcnt

There is a solution using the __popcnt intrinsic (you need __popcnt to be extremely fast or any perf gains over a simple loop solution will be moot. Fortunately most SSE4+ era processors support it).

// lookup table for sub-problem: 8-bit v
byte PRECOMP[256][8] = { .... } // PRECOMP[v][n] for v < 256 and n < 8

ulong nthSetBit(ulong v, ulong n) {
    ulong p = __popcnt(v & 0xFFFF);
    ulong shift = 0;
    if (p <= n) {
        v >>= 16;
        shift += 16;
        n -= p;
    }
    p = __popcnt(v & 0xFF);
    if (p <= n) {
        shift += 8;
        v >>= 8;
        n -= p;
    }

    if (n >= 8) return 0; // optional safety, in case n > # of set bits
    return PRECOMP[v & 0xFF][n] << shift;
}

This illustrates how the divide and conquer approach works.

General Solution

There is also a solution for "general" architectures- without __popcnt. It can be done by processing in 8-bit chunks. You need one more lookup table that tells you the popcnt of a byte:

byte PRECOMP[256][8] = { .... } // PRECOMP[v][n] for v<256 and n < 8
byte POPCNT[256] = { ... } // POPCNT[v] is the number of set bits in v. (v < 256)

ulong nthSetBit(ulong v, ulong n) {
    ulong p = POPCNT[v & 0xFF];
    ulong shift = 0;
    if (p <= n) {
        n -= p;
        v >>= 8;
        shift += 8;
        p = POPCNT[v & 0xFF];
        if (p <= n) {
            n -= p;
            shift += 8;
            v >>= 8;
            p = POPCNT[v & 0xFF];
            if (p <= n) {
                n -= p;
                shift += 8;
                v >>= 8;
            }
        }
    }

    if (n >= 8) return 0; // optional safety, in case n > # of set bits
    return PRECOMP[v & 0xFF][n] << shift;
}

This could, of course, be done with a loop, but the unrolled form is faster and the unusual form of the loop would make it unlikely that the compiler could automatically unroll it for you.