Greatest Common Divisor

Retina, 16

^(.+)\1* \1+$

This doesn't use Euclid's algorithm at all - instead it finds the GCD using regex matching groups.

Try it online. - This example calculates GCD(8,12).

Input as 2 space-separated integers. Note that the I/O is in unary. If that is not acceptable, then we can do this:

Retina, 30

^(.+)\1* \1+$

Try it online.

As @MartinBüttner points out, this falls apart for large numbers (as is generally the case for anything unary). At very a minimum, an input of INT_MAX will require allocation of a 2GB string.

i386 (x86-32) machine code, 8 bytes (9B for unsigned)

+1B if we need to handle b = 0 on input.

amd64 (x86-64) machine code, 9 bytes (10B for unsigned, or 14B 13B for 64b integers signed or unsigned)

10 9B for unsigned on amd64 that breaks with either input = 0

Inputs are 32bit non-zero signed integers in eax and ecx. Output in eax.

## 32bit code, signed integers:  eax, ecx
08048420 <gcd0>:
 8048420:       99                      cdq               ; shorter than xor edx,edx
 8048421:       f7 f9                   idiv   ecx
 8048423:       92                      xchg   edx,eax    ; there's a one-byte encoding for xchg eax,r32.  So this is shorter but slower than a mov
 8048424:       91                      xchg   ecx,eax    ; eax = divisor(from ecx), ecx = remainder(from edx), edx = quotient(from eax) which we discard
    ; loop entry point if we need to handle ecx = 0
 8048425:       41                      inc    ecx        ; saves 1B vs. test/jnz in 32bit mode
 8048426:       e2 f8                   loop   8048420 <gcd0>
08048428 <gcd0_end>:
 ; 8B total
 ; result in eax: gcd(a,0) = a

This loop structure fails the test-case where ecx = 0. (div causes a #DE hardware execption on divide by zero. (On Linux, the kernel delivers a SIGFPE (floating point exception)). If the loop entry point was right before the inc, we'd avoid the problem. The x86-64 version can handle it for free, see below.

Mike Shlanta's answer was the starting point for this. My loop does the same thing as his, but for signed integers because cdq is one byter shorter than xor edx,edx. And yes, it does work correctly with one or both inputs negative. Mike's version will run faster and take less space in the uop cache (xchg is 3 uops on Intel CPUs, and loop is really slow on most CPUs), but this version wins at machine-code size.

I didn't notice at first that the question required unsigned 32bit. Going back to xor edx,edx instead of cdq would cost one byte. div is the same size as idiv, and everything else can stay the same (xchg for data movement and inc/loop still work.)

Interestingly, for 64bit operand-size (rax and rcx), signed and unsigned versions are the same size. The signed version needs a REX prefix for cqo (2B), but the unsigned version can still use 2B xor edx,edx.

In 64bit code, inc ecx is 2B: the single-byte inc r32 and dec r32 opcodes were repurposed as REX prefixes. inc/loop doesn't save any code-size in 64bit mode, so you might as well test/jnz. Operating on 64bit integers adds another one byte per instruction in REX prefixes, except for loop or jnz. It's possible for the remainder to have all zeros in the low 32b (e.g. gcd((2^32), (2^32 + 1))), so we need to test the whole rcx and can't save a byte with test ecx,ecx. However, the slower jrcxz insn is only 2B, and we can put it at the top of the loop to handle ecx=0 on entry:

## 64bit code, unsigned 64 integers:  rax, rcx
0000000000400630 <gcd_u64>:
  400630:       e3 0b                   jrcxz  40063d <gcd_u64_end>   ; handles rcx=0 on input, and smaller than test rcx,rcx/jnz
  400632:       31 d2                   xor    edx,edx                ; same length as cqo
  400634:       48 f7 f1                div    rcx                      ; REX prefixes needed on three insns
  400637:       48 92                   xchg   rdx,rax
  400639:       48 91                   xchg   rcx,rax
  40063b:       eb f3                   jmp    400630 <gcd_u64>
000000000040063d <gcd_u64_end>:
## 0xD = 13 bytes of code
## result in rax: gcd(a,0) = a

Full runnable test program including a main that runs printf("...", gcd(atoi(argv[1]), atoi(argv[2])) ); source and asm output on the Godbolt Compiler Explorer, for the 32 and 64b versions. Tested and working for 32bit (-m32), 64bit (-m64), and the x32 ABI (-mx32).

Also included: a version using repeated subtraction only, which is 9B for unsigned, even for x86-64 mode, and can take one of its inputs in an arbitrary register. However, it can't handle either input being 0 on entry (it detect when sub produces a zero, which x - 0 never does).

GNU C inline asm source for the 32bit version (compile with gcc -m32 -masm=intel)

int gcd(int a, int b) {
    asm (// ".intel_syntax noprefix\n"
        // "jmp  .Lentry%=\n" // Uncomment to handle div-by-zero, by entering the loop in the middle.  Better: `jecxz / jmp` loop structure like the 64b version
        ".p2align 4\n"                  // align to make size-counting easier
         "gcd0:   cdq\n\t"              // sign extend eax into edx:eax.  One byte shorter than xor edx,edx
         "        idiv    ecx\n"
         "        xchg    eax, edx\n"   // there's a one-byte encoding for xchg eax,r32.  So this is shorter but slower than a mov
         "        xchg    eax, ecx\n"   // eax = divisor(ecx), ecx = remainder(edx), edx = garbage that we will clear later
         "        inc     ecx\n"        // saves 1B vs. test/jnz in 32bit mode, none in 64b mode
         "        loop    gcd0\n"
         : /* outputs */  "+a" (a), "+c"(b)
         : /* inputs */   // given as read-write outputs
         : /* clobbers */ "edx"
    return a;

Normally I'd write a whole function in asm, but GNU C inline asm seems to be the best way to include a snippet which can have in/outputs in whatever regs we choose. As you can see, GNU C inline asm syntax makes asm ugly and noisy. It's also a really difficult way to learn asm.

It would actually compile and work in .att_syntax noprefix mode, because all the insns used are either single/no operand or xchg. Not really a useful observation.

Hexagony, 17 bytes

?'?>}[email protected]<\=%)>{\.(


  ? ' ?
 > } ! @
< \ = % )
 > { \ .
  ( . .

Try it online!

Fitting it into side-length 3 was a breeze. Shaving off those two bytes at the end wasn't... I'm also not convinced it's optimal, but I'm sure I think it's close.


Another Euclidean algorithm implementation.

The program uses three memory edges, which I'll call A, B and C, with the memory pointer (MP) starting out as shown:

enter image description here

Here is the control flow diagram:

enter image description here

Control flow starts on the grey path with a short linear bit for input:

?    Read first integer into memory edge A.
'    Move MP backwards onto edge B.
?    Read second integer into B.

Note that the code now wraps around the edges to the < in the left corner. This < acts as a branch. If the current edge is zero (i.e. the Euclidean algorithm terminates), the IP is deflected to the left and takes the red path. Otherwise, an iteration of the Euclidean algorithm is computed on the green path.

We'll first consider the green path. Note that > and \ all acts as mirrors which simply deflect the instruction pointer. Also note that control flow wraps around the edges three times, once from the bottom to the top, once from the right corner to the bottom row and finally from the bottom right corner to the left corner to re-check the condition. Also note that . are no-ops.

That leaves the following linear code for a single iteration:

{    Move MP forward onto edge C.
'}   Move to A and back to C. Taken together this is a no-op.
=    Reverse the direction of the MP so that it now points at A and B. 
%    Compute A % B and store it in C.
)(   Increment, decrement. Taken together this is a no-op, but it's
     necessary to ensure that IP wraps to the bottom row instead of
     the top row.

Now we're back where we started, except that the three edges have changed their roles cyclically (the original C now takes the role of B and the original B the role of A...). In effect, we've relpaced inputs A and B with B and A % B, respectively.

Once A % B (on edge C) is zero, the GCD can be found on edge B. Again the > just deflects the IP, so on the red path we execute:

}    Move MP to edge B.
!    Print its value as an integer.
@    Terminate the program.