Shift masked bits to the lsb

This operation is known as compress right. It is implemented as part of BMI2 as the PEXT instruction, in Intel processors as of Haswell.

Unfortunately, without hardware support is it a quite annoying operation. Of course there is an obvious solution, just moving the bits one by one in a loop, here is the one given by Hackers Delight:

unsigned compress(unsigned x, unsigned m) {
   unsigned r, s, b;    // Result, shift, mask bit. 

   r = 0; 
   s = 0; 
   do {
      b = m & 1; 
      r = r | ((x & b) << s); 
      s = s + b; 
      x = x >> 1; 
      m = m >> 1; 
   } while (m != 0); 
   return r; 
} 

But there is an other way, also given by Hackers Delight, which does less looping (number of iteration logarithmic in the number of bits) but more per iteration:

unsigned compress(unsigned x, unsigned m) {
   unsigned mk, mp, mv, t; 
   int i; 

   x = x & m;           // Clear irrelevant bits. 
   mk = ~m << 1;        // We will count 0's to right. 

   for (i = 0; i < 5; i++) {
      mp = mk ^ (mk << 1);             // Parallel prefix. 
      mp = mp ^ (mp << 2); 
      mp = mp ^ (mp << 4); 
      mp = mp ^ (mp << 8); 
      mp = mp ^ (mp << 16); 
      mv = mp & m;                     // Bits to move. 
      m = m ^ mv | (mv >> (1 << i));   // Compress m. 
      t = x & mv; 
      x = x ^ t | (t >> (1 << i));     // Compress x. 
      mk = mk & ~mp; 
   } 
   return x; 
}

Notice that a lot of the values there depend only on m. Since you only have 512 different masks, you could precompute those and simplify the code to something like this (not tested)

unsigned compress(unsigned x, int maskindex) {
   unsigned t; 
   int i; 

   x = x & masks[maskindex][0];

   for (i = 0; i < 5; i++) {
      t = x & masks[maskindex][i + 1]; 
      x = x ^ t | (t >> (1 << i));
   } 
   return x; 
}

Of course all of these can be turned into "not a loop" by unrolling, the second and third ways are probably more suitable for that. That's a bit of cheat however.


You can use the pack-by-multiplication technique similar to the one described here. This way you don't need any loop and can mix the bits in any order.

For example with the mask 0b10101001 == 0xA9 like above and 8-bit data abcdefgh (with a-h is the 8 bits) you can use the below expression to get 0000aceh

uint8_t compress_maskA9(uint8_t x)
{
    const uint8_t mask1 = 0xA9 & 0xF0;
    const uint8_t mask2 = 0xA9 & 0x0F;
    return (((x & mask1)*0x03000000 >> 28) & 0x0C) | ((x & mask2)*0x50000000 >> 30);
}

In this specific case there are some overlaps of the 4 bits while adding (which incur unexpected carry) during the multiplication step, so I've split them into 2 parts, the first one extracts bit a and c, then e and h will be extracted in the latter part. There are other ways to split the bits as well, like a & h then c & e. You can see the results compared to Harold's function live on ideone

An alternate way with only one multiplication

const uint32_t X = (x << 8) | x;
return (X & 0x8821)*0x12050000 >> 28;

I got this by duplicating the bits so that they're spaced out farther, leaving enough space to avoid the carry. This is often better than splitting into 2 multiplications


If you want the result's bits reversed (i.e. heca0000) you can easily change the magic numbers accordingly

// result: he00 | 00ca;
return (((x & 0x09)*0x88000000 >> 28) & 0x0C) | (((x & 0xA0)*0x04800000) >> 30);

or you can also extract the 3 bits e, c and a at the same time, leaving h separately (as I mentioned above, there are often multiple solutions) and you need only one multiplication

return ((x & 0xA8)*0x12400000 >> 29) | (x & 0x01) << 3; // result: 0eca | h000

But there might be a better alternative like the above second snippet

const uint32_t X = (x << 8) | x;
return (X & 0x2881)*0x80290000 >> 28

Correctness check: http://ideone.com/PYUkty

For a larger number of masks you can precompute the magic numbers correspond to those masks and store them in an array so that you can look them up immediately for use. I calculated those mask by hand but you can do that automatically


Explanation

We have abcdefgh & mask1 = a0c00000. Multiply it with magic1

    ........................a0c00000
 ×  00000011000000000000000000000000 (magic1 = 0x03000000)
    ────────────────────────────────
    a0c00000........................
 + a0c00000......................... (the leading "a" bit is outside int's range
    ────────────────────────────────  so it'll be truncated)
r1 = acc.............................

=> (r1 >> 28) & 0x0C = 0000ac00

Similarly we multiply abcdefgh & mask2 = 0000e00h with magic2

  ........................0000e00h
× 01010000000000000000000000000000 (magic2 = 0x50000000)
  ────────────────────────────────
  e00h............................
+ 0h..............................
  ────────────────────────────────
r2 = eh..............................

=> (r2 >> 30) = 000000eh

Combine them together we have the expected result

((r1 >> 28) & 0x0C) | (r2 >> 30) = 0000aceh

And here's the demo for the second snippet

                  abcdefghabcdefgh
&                 1000100000100001 (0x8821)
  ────────────────────────────────
                  a000e00000c0000h
× 00010010000001010000000000000000 (0x12050000)
  ────────────────────────────────
  000h
  00e00000c0000h
+ 0c0000h
  a000e00000c0000h
  ────────────────────────────────
= acehe0h0c0c00h0h
& 11110000000000000000000000000000
  ────────────────────────────────
= aceh

For the reversed order case:

                  abcdefghabcdefgh
&                 0010100010000001 (0x2881)
  ────────────────────────────────
                  00c0e000a000000h
x 10000000001010010000000000000000 (0x80290000)
  ────────────────────────────────
  000a000000h
  00c0e000a000000h
+ 0e000a000000h
  h
  ────────────────────────────────
  hecaea00a0h0h00h
& 11110000000000000000000000000000
  ────────────────────────────────
= heca

Related:

  • How to create a byte out of 8 bool values (and vice versa)?
  • Redistribute least significant bits from a 4-byte array to a nibble

Tags:

C++

C

Bitmask