Higher part of multiply and division in C or C++?

You don't deal with the implementation details in C or C++. That's the whole point. If you want the the most significant bytes, simple use the language. Right shift >> is designed to do that. Something like:

uint64_t i;
uint32_t a;
uint32_t b;
// input a, b and set i to a * b
// this should be done with (thanks to @nnn, pls see comment below):
// i = a; i *= b;
uint64_t msb = i >> 32;

You can do it easily in C this way:

#include <stdint.h>

uint32_t a, b;  // input
uint64_t val = (uint64_t)a * b;
uint32_t high = val >> 32, low = val;

Leave it to the compiler to produce the best possible code. Modern optimizers are really good at it. Hand coded assembly often looks better but performs worse.

As commented by Pete Becker, the above relies on availability of the types uint32_t and uint64_t. If you insist on die hard portability (say you are programming on a DS9K), you may instead use the types uint_least32_t and uint_least64_t or uint_fast32_t and uint_fast64_t that are always available under C99, but you need an extra mask, that will be optimized out if not required:

#include <stdint.h>

uint_fast32_t a, b;  // input
uint_fast64_t val = (uint_fast64_t)a * b;
uint_fast32_t high = (val >> 32) & 0xFFFFFFFF, low = val & 0xFFFFFFFF;

Regarding division, you can use the C99 library functions div, ldiv or lldiv to perform signed division and remainder operations in one call. The division/modulo combination will be implemented in one operation if possible on the target architecture for the specific operand types.

It may be more efficient to write both expressions and rely on the compiler to detect the pattern and produce code that uses a single IDIV opcode:

struct divmod_t { int quo, rem; };
struct divmod_t divmod(int num, int denom) {
    struct divmod_t r = { num / denom, num % denom };
    return r;
}

Testing on Matt Godbolt's compiler explorer shows both clang and gcc generate a single idiv instruction for this code at -O3.

You can turn one of these divisions into a multiplication:

struct divmod_t { int quo, rem; };
struct divmod_t divmod2(int num, int denom) {
    struct divmod_t r;
    r.quo = num / denom;
    r.rem = num - r.quo * denom;
    return r;
}

Note that the above functions do not check for potential overflow, which results in undefined behavior. Overflow occurs if denom = 0 and if num = INT_MIN and denom = -1.

For division, a fully portable solution uses one of the library function div, ldiv, or lldiv.

For multiplication, only Forth among widely known languages (higher than assembler) has an explicit multiplication of N*N bits to 2N-bit result (the words M*, UM*). C, Fortran, etc. don't have it. Yes, this sometimes leads into misoptimization. For example, on x86_32, getting a 64-bit product requires either converting a number to 64-bit one (can cause library call instead of mul command), or an explicit inline assembly call (simple and efficient in gcc and clones, but not always in MSVC and other compilers).

In my tests on x86_32 (i386), a modern compiler is able to convert code like

#include <stdint.h>
int64_t mm(int32_t x, int32_t y) {
  return (int64_t) x * y;
}

to simple "imull" instruction without a library call; clang 3.4 (-O1 or higher) and gcc 4.8 (-O2 or higher) satisfies this, and I guess this won't stop ever. (With lesser optimization level, a second useless multiplication is added.) But one can't guarantee this for any other compiler without a real test. With gcc on x86, the following will work even without optimization:

int64_t mm(int32_t x, int32_t y) {
  int64_t r;
  asm("imull %[s]" : "=A" (r): "a" (x), [s] "bcdSD" (y): "cc");
  return r;
}

The same trend, with similar commands, is true for nearly all modern CPUs.

For division (like 64-bit dividend by 32-bit divisor to 32-bit quotient and remainders), this is more complicated. There are library functions like `lldiv' but they are only for signed division; there are no unsigned equivalents. Also, they are library calls with the all respective cost. But, the issue here is that many modern architectures doesn't have this kind of division. For example, it's explicitly excluded from ARM64 and RISC-V. For them, one have to emulate long division using shorter one (e.g. divide 2**(N-1) by a dividend but then double the result and tune its remainder). For those having mixed-length division (x86, M68k, S/390, etc.), a one-line assembly inliner is rather good if you are sure it won't overflow :)

Some architectures lacks division support at all (older Sparc, Alpha), and that's a standard library task to support such operations.

Anyway, a standard library provides all needed operations unless you require the highest precision (e.g. x86_64 can divide 128-bit dividend by 64-bit divisor, but this isn't supported by C library).

I think the most elaborated and accessible example of these approaches for different architectures is GMP library. It's much more advanced than for your question, but you can dig examples for division by a single limb for different architectures, it implements proper chaining even if architecture doesn't support it directly. Also it will suffice very most needs for arbitrary long number arithmetic, despite with some overhead.

NB if you call div-like instruction explicitly, it's your responsibility to check for overflows. It's more trickier in signed case than in unsigned one; for example, division of -2147483648 by -1 crashes a x86-based program, even if written in C.

UPDATE[2020-07-04]: with GCC Integer overflow builtins, one can use multiplication using mixed precision, like:

#include <stdint.h>
int64_t mm(int32_t x, int32_t y) {
  int64_t result;
  __builtin_mul_overflow(x, y, &result);
  return result;
}

this is translated by both GCC and Clang to optimal form in most of cases. I hope other compilers and even standards will eventually adopt this.

Higher part of multiply and division in C or C++?

Tags:

C++

C

Related

Recent Posts