# Greatest Common Divisor

## Retina, 16

```
^(.+)\1* \1+$
$1
```

This doesn't use Euclid's algorithm at all - instead it finds the GCD using regex matching groups.

Try it online. - This example calculates GCD(8,12).

Input as 2 space-separated integers. Note that the I/O is in unary. If that is not acceptable, then we can do this:

## Retina, 30

```
\d+
$*
^(.+)\1* \1+$
$1
1+
$.&
```

Try it online.

As @MartinBüttner points out, this falls apart for large numbers (as is generally the case for anything unary). At very a minimum, an input of INT_MAX will require allocation of a 2GB string.

### i386 (x86-32) machine code, 8 bytes (9B for unsigned)

+1B if we need to handle `b = 0`

on input.

### amd64 (x86-64) machine code, 9 bytes (10B for unsigned, or ~~14B~~ 13B for 64b integers signed or unsigned)

~~10~~ 9B for unsigned on amd64 that breaks with either input = 0

Inputs are 32bit non-zero *signed* integers in `eax`

and `ecx`

. Output in `eax`

.

```
## 32bit code, signed integers: eax, ecx
08048420 <gcd0>:
8048420: 99 cdq ; shorter than xor edx,edx
8048421: f7 f9 idiv ecx
8048423: 92 xchg edx,eax ; there's a one-byte encoding for xchg eax,r32. So this is shorter but slower than a mov
8048424: 91 xchg ecx,eax ; eax = divisor(from ecx), ecx = remainder(from edx), edx = quotient(from eax) which we discard
; loop entry point if we need to handle ecx = 0
8048425: 41 inc ecx ; saves 1B vs. test/jnz in 32bit mode
8048426: e2 f8 loop 8048420 <gcd0>
08048428 <gcd0_end>:
; 8B total
; result in eax: gcd(a,0) = a
```

This loop structure fails the test-case where `ecx = 0`

. (`div`

causes a `#DE`

hardware execption on divide by zero. (On Linux, the kernel delivers a `SIGFPE`

(floating point exception)). If the loop entry point was right before the `inc`

, we'd avoid the problem. The x86-64 version can handle it for free, see below.

**Mike Shlanta's answer was the starting point for this**. My loop does the same thing as his, but for signed integers because `cdq`

is one byter shorter than `xor edx,edx`

. And yes, it does work correctly with one or both inputs negative. Mike's version will run faster and take less space in the uop cache (`xchg`

is 3 uops on Intel CPUs, and `loop`

is really slow on most CPUs), but this version wins at machine-code size.

I didn't notice at first that the question required **unsigned** 32bit. Going back to `xor edx,edx`

instead of `cdq`

would cost one byte. `div`

is the same size as `idiv`

, and everything else can stay the same (`xchg`

for data movement and `inc/loop`

still work.)

Interestingly, for 64bit operand-size (`rax`

and `rcx`

), signed and unsigned versions are the same size. The signed version needs a REX prefix for `cqo`

(2B), but the unsigned version can still use 2B `xor edx,edx`

.

In 64bit code, `inc ecx`

is 2B: the single-byte `inc r32`

and `dec r32`

opcodes were repurposed as REX prefixes. `inc/loop`

doesn't save any code-size in 64bit mode, so you might as well `test/jnz`

. Operating on 64bit integers adds another one byte per instruction in REX prefixes, except for `loop`

or `jnz`

. It's possible for the remainder to have all zeros in the low 32b (e.g. `gcd((2^32), (2^32 + 1))`

), so we need to test the whole rcx and can't save a byte with `test ecx,ecx`

. However, the slower `jrcxz`

insn is only 2B, **and we can put it at the top of the loop to handle ecx=0 on entry**:

```
## 64bit code, unsigned 64 integers: rax, rcx
0000000000400630 <gcd_u64>:
400630: e3 0b jrcxz 40063d <gcd_u64_end> ; handles rcx=0 on input, and smaller than test rcx,rcx/jnz
400632: 31 d2 xor edx,edx ; same length as cqo
400634: 48 f7 f1 div rcx ; REX prefixes needed on three insns
400637: 48 92 xchg rdx,rax
400639: 48 91 xchg rcx,rax
40063b: eb f3 jmp 400630 <gcd_u64>
000000000040063d <gcd_u64_end>:
## 0xD = 13 bytes of code
## result in rax: gcd(a,0) = a
```

Full runnable test program including a `main`

that runs `printf("...", gcd(atoi(argv[1]), atoi(argv[2])) );`

**source and asm output on the Godbolt Compiler Explorer**, for the 32 and 64b versions. Tested and working for 32bit (`-m32`

), 64bit (`-m64`

), and the x32 ABI (`-mx32`

).

Also included: a version using repeated subtraction only, which is 9B for unsigned, even for x86-64 mode, and can take one of its inputs in an arbitrary register. However, it can't handle either input being 0 on entry (it detect when `sub`

produces a zero, which x - 0 never does).

GNU C inline asm source for the 32bit version (compile with `gcc -m32 -masm=intel`

)

```
int gcd(int a, int b) {
asm (// ".intel_syntax noprefix\n"
// "jmp .Lentry%=\n" // Uncomment to handle div-by-zero, by entering the loop in the middle. Better: `jecxz / jmp` loop structure like the 64b version
".p2align 4\n" // align to make size-counting easier
"gcd0: cdq\n\t" // sign extend eax into edx:eax. One byte shorter than xor edx,edx
" idiv ecx\n"
" xchg eax, edx\n" // there's a one-byte encoding for xchg eax,r32. So this is shorter but slower than a mov
" xchg eax, ecx\n" // eax = divisor(ecx), ecx = remainder(edx), edx = garbage that we will clear later
".Lentry%=:\n"
" inc ecx\n" // saves 1B vs. test/jnz in 32bit mode, none in 64b mode
" loop gcd0\n"
"gcd0_end:\n"
: /* outputs */ "+a" (a), "+c"(b)
: /* inputs */ // given as read-write outputs
: /* clobbers */ "edx"
);
return a;
}
```

Normally I'd write a whole function in asm, but GNU C inline asm seems to be the best way to include a snippet which can have in/outputs in whatever regs we choose. As you can see, GNU C inline asm syntax makes asm ugly and noisy. It's also a really difficult way to *learn* asm.

It would actually compile and work in `.att_syntax noprefix`

mode, because all the insns used are either single/no operand or `xchg`

. Not really a useful observation.

### Hexagony, 17 bytes

```
?'?>}[email protected]<\=%)>{\.(
```

Unfolded:

```
? ' ?
> } ! @
< \ = % )
> { \ .
( . .
```

Try it online!

Fitting it into side-length 3 was a breeze. Shaving off those two bytes at the end wasn't... I'm also not convinced it's optimal, but I'm sure I think it's close.

### Explanation

Another Euclidean algorithm implementation.

The program uses three memory edges, which I'll call **A**, **B** and **C**, with the memory pointer (MP) starting out as shown:

Here is the control flow diagram:

Control flow starts on the grey path with a short linear bit for input:

```
? Read first integer into memory edge A.
' Move MP backwards onto edge B.
? Read second integer into B.
```

Note that the code now wraps around the edges to the `<`

in the left corner. This `<`

acts as a branch. If the current edge is zero (i.e. the Euclidean algorithm terminates), the IP is deflected to the left and takes the red path. Otherwise, an iteration of the Euclidean algorithm is computed on the green path.

We'll first consider the green path. Note that `>`

and `\`

all acts as mirrors which simply deflect the instruction pointer. Also note that control flow wraps around the edges three times, once from the bottom to the top, once from the right corner to the bottom row and finally from the bottom right corner to the left corner to re-check the condition. Also note that `.`

are no-ops.

That leaves the following linear code for a single iteration:

```
{ Move MP forward onto edge C.
'} Move to A and back to C. Taken together this is a no-op.
= Reverse the direction of the MP so that it now points at A and B.
% Compute A % B and store it in C.
)( Increment, decrement. Taken together this is a no-op, but it's
necessary to ensure that IP wraps to the bottom row instead of
the top row.
```

Now we're back where we started, except that the three edges have changed their roles cyclically (the original **C** now takes the role of **B** and the original **B** the role of **A**...). In effect, we've relpaced inputs `A`

and `B`

with `B`

and `A % B`

, respectively.

Once `A % B`

(on edge **C**) is zero, the GCD can be found on edge **B**. Again the `>`

just deflects the IP, so on the red path we execute:

```
} Move MP to edge B.
! Print its value as an integer.
@ Terminate the program.
```