Fast divisibility tests (by 2,3,4,5,.., 16)?

In every case (including divisible by 2):

if (number % n == 0) do();

Anding with a mask of low order bits is just obfuscation, and with a modern compiler will not be any faster than writing the code in a readable fashion.

If you have to test all of the cases, you might improve performance by putting some of the cases in the if for another: there's no point it testing for divisibility by 4 if divisibility by 2 has already failed, for example.

It is not a bad idea AT ALL to figure out alternatives to division instructions (which includes modulo on x86/x64) because they are very slow. Slower (or even much slower) than most people realize. Those suggesting "% n" where n is a variable are giving foolish advice because it will invariably lead to the use of the division instruction. On the other hand "% c" (where c is a constant) will allow the compiler to determine the best algorithm available in its repertoire. Sometimes it will be the division instruction but a lot of the time it won't.

In this document Torbjörn Granlund shows that the ratio of clock cycles required for unsigned 32-bit mults:divs is 4:26 (6.5x) on Sandybridge and 3:45 (15x) on K10. for 64-bit the respective ratios are 4:92 (23x) and 5:77 (14.4x).

The "L" columns denote latency. "T" columns denote throughput. This has to do with the processor's ability to handle multiple instructions in parallell. Sandybridge can issue one 32-bit multiplication every other cycle or one 64-bit every cycle. For K10 the corresponding throughput is reversed. For divisions the K10 needs to complete the entire sequence before it may begin another. I suspect it is the same for Sandybridge.

Using the K10 as an example it means that during the cycles required for a 32-bit division (45) the same number (45) of multiplications can be issued and the next-to-last and last one of these will complete one and two clock cycles after the division has completed. A LOT of work can be performed in 45 multiplications.

It is also interesting to note that divs have become less efficient with the evolution from K8-K9 to K10: from 39 to 45 and 71 to 77 clock cycles for 32- and 64-bit.

Granlund's page at gmplib.org and at the Royal Institute of Technology in Stockholm contain more goodies, some of which have been incorporated into the gcc compiler.

As @James mentioned, let the compiler simplify it for you. If n is a constant, any descent compiler is able to recognize the pattern and change it to a more efficient equivalent.

For example, the code

#include <stdio.h>

int main() {
    size_t x;
    scanf("%u\n", &x);
    __asm__ volatile ("nop;nop;nop;nop;nop;");
    const char* volatile foo = (x%3 == 0) ? "yes" : "no";
    __asm__ volatile ("nop;nop;nop;nop;nop;");
    printf("%s\n", foo);
    return 0;
}

compiled with g++-4.5 -O3, the relevant part of x%3 == 0 will become

mov    rcx,QWORD PTR [rbp-0x8]   # rbp-0x8 = &x
mov    rdx,0xaaaaaaaaaaaaaaab
mov    rax,rcx
mul    rdx
lea    rax,"yes"
shr    rdx,1
lea    rdx,[rdx+rdx*2]
cmp    rcx,rdx
lea    rdx,"no"
cmovne rax,rdx
mov    QWORD PTR [rbp-0x10],rax

which, translated back to C code, means

(hi64bit(x * 0xaaaaaaaaaaaaaaab) / 2) * 3 == x ? "yes" : "no"
// equivalatent to:                 x % 3 == 0 ? "yes" : "no"

no division involved here. (Note that 0xaaaaaaaaaaaaaaab == 0x20000000000000001L/3)

Edit:

The magic constant 0xaaaaaaaaaaaaaaab can be computed in http://www.hackersdelight.org/magic.htm
For divisors of the form 2ⁿ - 1, check http://graphics.stanford.edu/~seander/bithacks.html#ModulusDivision

Fast divisibility tests (by 2,3,4,5,.., 16)?

Tags:

C++

C

Math

Bit Manipulation

Assembly

Related

Recent Posts