Efficiently randomly shuffling the bits of a sequence of words

It's obvious that asymptotically, the speed is O(N), where N is number of bits. Our goal is to improve the constants involved in it.

Disclaimer: the description proposed algorithm is a rough sketch. There are a lot of stuffs needs to be added and, especially, a lot of details that needs to be cared of in order to make it work correctly. The approximated execution time will not be different from what is claimed here though.


Baseline Algorithm

The most obvious one is the textbook approach, which takes N operations, each of which involves calling the random_generator which takes R milliseconds, and accessing the bit's value of two different bits, and set new value to them in total of 4 * A milliseconds (A is time to read/write one bit). Suppose that the array lookup operations takes C milliseconds. So the total time of this algorithm is N * (R + 4 * A + 2 * C) milliseconds (approximately). It is also reasonable to assume that the random number generation takes more time, i.e. R >> A == C.


Proposed Algorithm

Suppose the bits are stored in a byte storage, i.e. we will work with blocks of bytes.

unsigned char bit_field[field_size = N / 8];

First, let's count the number of 1 bits in our bitset. For that, we can use a lookup-table and iterate through the bitset as byte array:

# Generate lookup-table, you may modify it with `constexpr`
# to make it run in compile time.
int bitcount_lookup[256];
for (int = 0; i < 256; ++i) {
  bitcount_lookup[i] = 0;
  for (int b = 0; b < 8; ++b)
    bitcount_lookup[i] += (i >> b) & 1;
}

We can treat this as preprocessing overhead (as it may as well be calculated at compile-time) and say that it takes 0 milliseconds. Now, counting number of 1 bits is easy (the following will take (N / 8) * C milliseconds):

int bitcount = 0;
for (auto *it = bit_field; it != bit_field + field_size; ++it)
  bitcount += bitcount_lookup[*it];

Now, we randomly generate N / 8 numbers (let's call the resulting array gencnt[N / 8]), each in the range [0..8], such that they sums up to bitcount. This is a bit tricky and kind of hard to do it uniformly (the "correct" algorithm to generate uniform distribution is quite slow comparing to the baseline algo). A quite uniform-ish but quick solution is roughly:

  • Fill the gencnt[N / 8] array with values v = bitcount / (N / 8).
  • Randomly choose N / 16 "black" cells. The rests are "white". The algorithm is similar to random permutation, but only of half of the array.
  • Generate N / 16 random numbers in the range [0..v]. Let's call them tmp[N / 16].
  • Increase "black" cells by tmp[i] values, and decrease "white" cells by tmp[i]. This will ensure that the overall sum is bitcount.

After that, we will have a uniform-ish random-ish array gencnt[N / 8], the value of which are the number of 1 bytes in a particular "cell". It was all generated in:

(N / 8) * C   +  (N / 16) * (4 * C)  +  (N / 16) * (R + 2 * C)
^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^^^^^^
filling step      random coloring              filling

milliseconds (this estimation is done with a concrete implementation in my mind). Lastly, we can have a lookup table of the bytes with specified number of bits set to 1 (can be compiled overhead, or even in compile-time as constexpr, so let's assume that this takes 0 milliseconds):

std::vector<std::vector<unsigned char>> random_lookup(8);
for (int c = 0; c < 8; c++)
  random_lookup[c] = { /* numbers with `c` bits set to `1` */ };

Then, we can fill our bit_field as follows (which takes roughly (N / 8) * (R + 3 * C) milliseconds):

for (int i = 0; i < field_size; i++) {
  bit_field[i] = random_lookup[gencnt[i]][rand() % gencnt[i].size()];

Summing everything up, we have the total execution time:

T = (N / 8) * C +
    (N / 8) * C + (N / 16) * (4 * C) + (N / 16) * (R + 2 * C) + 
    (N / 8) * (R + 3 * C)

  = N * (C + (3/16) * R)  <  N * (R + 4 * A + 2 * C)
    ^^^^^^^^^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^^^^^^^
     proposed algorithm        naive baseline algo

Although it's not truly uniformly random, but it does spread the bits out quite evenly and randomly, and it's quite fast and hopefully gets the job done in your use-case.