Faster divisibility test than % operator?

What you’re doing is called strength reduction: replacing an expensive operation with a series of cheap ones.

The mod instruction on many CPUs is slow, because it historically was not tested in several common benchmarks and the designers therefore optimized other instructions instead. This algorithm will perform worse if it has to do many iterations, and % will perform better on a CPU where it needs only two clock cycles.

Finally, be aware that there are many shortcuts to take the remainder of division by specific constants. (Although compilers will generally take care of this for you.)


I will answer my question myself. It seems that I became a victim of branch prediction. The mutual size of the operands does not seem to matter, only their order.

Consider the following implementation

int divisible_ui_p(unsigned int m, unsigned int a)
{
    while (m > a) {
        m += a;
        m >>= __builtin_ctz(m);
    }

    if (m == a) {
        return 1;
    }

    return 0;
}

and the arrays

unsigned int A[100000/2];
unsigned int M[100000-1];

for (unsigned int a = 1; a < 100000; a += 2) {
    A[a/2] = a;
}
for (unsigned int m = 1; m < 100000; m += 1) {
    M[m-1] = m;
}

which are / are not shuffled using the shuffle function.

Without shuffling, the results are still

| implementation     | time [secs] |
|--------------------|-------------|
| divisible_ui_p     |    8.56user |
| builtin % operator |   17.59user |

However, once I shuffle these arrays, the results are different

| implementation     | time [secs] |
|--------------------|-------------|
| divisible_ui_p     |   31.34user |
| builtin % operator |   17.53user |