Compute the Adler-32 checksum

Mathematica, 46 bytes

{1,4^8}.Fold[##+{0,#&@@#}&,{1,0},#]~Mod~65521&

An anonymous function that takes an integer array and returns the Adler-32, with some improvements from miles and Martin (see comments).

miles' is also 46 bytes, but faster:

{1,4^8}.{Tr@#+1,Tr[Accumulate@#+1]}~Mod~65521&

Julia, 73 46 bytes

x->[sum(x)+1;sum(cumsum(x)+1)]%65521⋅[1;4^8]

This is an anonymous function that accepts an array and returns an integer. To call it, assign it to a variable.

We combine sum(x) + 1 and sum(cumsum(x) + 1) into an array, where x is the input array, and take each modulo 65521. We then compute the dot product with 1 and 48, which gives us (sum(x) + 1) + 4^8 * sum(cumsum(x) + 1), which is exactly the Adler-32 formula.

Try it online! (Includes all test cases)

Saved 27 bytes thanks to Sp3000 and Dennis!


x86-64 machine code function: 33 32 bytes (or 31 30 bytes with an int[] input instead of char[])

x86-32 machine code function: 31 bytes

As a GNU C inline-asm code fragment: saves 2B 1B (just the ret insn).

Commented source and test driver on github

The 64bit version is callable directly from C with the standard System V x86-64 ABI (using 2 dummy args to get args in the regs I want). Custom calling conventions are not uncommon for asm code, so this is a bonus feature.

32bit machine code saves 1B, because merging the high and low halves with push16/push16 => pop32 only works in 32bit mode. A 32bit function would need a custom calling convention. We shouldn't hold that against it, but calling from C needs a wrapper function.

After processing 4096 ~ (ASCII 126) bytes, high = 0x3f040000, low = 0x7e001. So high's most-significant bit isn't set yet. My code takes advantage of this, sign-extending eax into edx:eax with cdq as a way of zeroing edx.

# See the NASM source below
0000000000401120 <golfed_adler32_amd64>:
  401120:       31 c0                   xor    eax,eax
  401122:       99                      cdq    
  401123:       8d 7a 01                lea    edi,[rdx+0x1]
0000000000401126 <golfed_adler32_amd64.byteloop>:
  401126:       ac                      lods   al,BYTE PTR ds:[rsi]
  401127:       01 c7                   add    edi,eax
  401129:       01 fa                   add    edx,edi
  40112b:       e2 f9                   loop   401126 <golfed_adler32_amd64.byteloop>
000000000040112d <golfed_adler32_amd64.end>:
  40112d:       66 b9 f1 ff             mov    cx,0xfff1
  401131:       92                      xchg   edx,eax
  401132:       99                      cdq    
  401133:       f7 f1                   div    ecx
  401135:       52                      push   rdx
  401136:       97                      xchg   edi,eax
  401137:       99                      cdq    
  401138:       f7 f1                   div    ecx
  40113a:       66 52                   push   dx      # this is the diff from last version: evil push/pop instead of shift/add
  40113c:       58                      pop    rax
  40113d:       66 5a                   pop    dx
  40113f:       c3                      ret    
0000000000401140 <golfed_adler32_amd64_end>:

0x40 - 0x20 = 32 bytes.


Commented NASM source:

tricks:

  • xchg eax, r32 is one byte; cheaper than mov. 8086 needed data in ax for a lot more stuff than >= 386, so they decided to spend a lot of opcode-space on the now-rarely-used xchg ax, r16.

  • Mixing push64 and push16 for merging high and low into a single register saves reg-reg data movement instructions around two divs. The 32bit version of this trick works even better: push16 / push16 / pop32 is only 5B total, not 6.

Since we push/pop, this isn't safe for inline asm in the SysV amd64 ABI (with a red zone).

golfed_adler32_amd64_v3:   ; (int dummy, const char *buf, int dummy, uint64_t len)

    ;; args: len in rcx,  const char *buf in rsi
    ;; Without dummy args, (unsigned len, const char *buf),  mov ecx, edi is the obvious solution, costing 2 bytes

    xor     eax,eax         ; scratch reg for loading bytes
    cdq                     ; edx: high=0
    lea     edi, [rdx+1]    ; edi: low=1
    ;jrcxz  .end            ; We don't handle len=0.  unlike rep, loop only checks rcx after decrementing
.byteloop:
    lodsb                   ; upper 24b of eax stays zeroed (no partial-register stall on Intel P6/SnB-family CPUs, thanks to the xor-zeroing)
    add     edi, eax        ; low += zero_extend(buf[i])
    add     edx, edi        ; high += low
    loop   .byteloop
.end:
    ;; exit when ecx = 0, eax = last byte of buf
    ;; lodsb at this point would load the terminating 0 byte, conveniently leaving eax=0

    mov     cx, 65521       ; ecx = m = adler32 magic constant.  (upper 16b of ecx is zero from the loop exit condition.  This saves 1B over mov r32,imm32)
    ;sub    cx, (65536 - 65521) ; the immediate is small enough to use the imm8 encoding.  No saving over mov, though, since this needs a mod/rm byte

    xchg    eax, edx        ; eax = high,  edx = buf[last_byte]
    cdq                     ; could be removed if we could arrange things so the loop ended with a load of the 0 byte

    div     ecx             ; div instead of idiv to fault instead of returning wrong answers if high has overflowed to negative.  (-1234 % m is negative)
    push    rdx             ; push high%m and 6B of zero padding

    xchg    eax, edi        ; eax=low
    cdq
    div     ecx             ; edx = low%m

    ;; concatenate the two 16bit halves of the result by putting them in contiguous memory
    push    dx              ; push low%m with no padding
    pop     rax             ; pop  high%m << 16 | low%m   (x86 is little-endian)

    pop     dx              ; add rsp, 2 to restore the stack pointer

    ;; outside of 16bit code, we can't justify returning the result in the dx:ax register pair
    ret
golfed_adler32_amd64_end_v3:

I also considered using rcx as an array index, instead of having two loop counters, but adler32(s) != adler32(reverse(s)). So we couldn't use loop. Counting from -len up towards zero and using movzx r32, [rsi+rcx] just uses way too many bytes.

If we want to consider incrementing the pointer ourself, 32bit code is probably the way to go. Even the x32 ABI (32bit pointers) isn't sufficient, because inc esi is 2B on amd64, but 1B on i386. It appears hard to beat xor eax,eax / lodsb / loop: 4B total to get each element in turn zero-extended into eax. inc esi / movzx r32, byte [esi] / loop is 5B.

scas is another option for incrementing a pointer with a 1B instruction in 64bit mode. (rdi/edi instead of rsi, so we'd take the pointer arg in rdi). We can't use the flag result from scas as a loop condition, though, because we don't want to keep eax zeroed. Different register allocation could maybe save a byte after the loop.


int[] input

The full function taking uint8_t[] is the "main" answer, because it's a more interesting challenge. Unpacking to int[] is an unreasonable thing to ask our caller to do in this language, but it does save 2B.

If we take our input as an unpacked array of 32bit integers, we can save one byte easily (use lodsd and replace xor eax,eax / cdq with just xor edx,edx).

We can save another byte by zeroing edx with lodsd/cdq, and re-arranging the loop so it loads the terminating 0 element before exiting. (We're still assuming it exists, even though this is an array of int, not a string).

; untested: I didn't modify the test driver to unpack strings for this
golfed_adler32_int_array:
    ; xor   edx,edx
    lodsd                   ; first element. only the low byte non-zero
    cdq                     ; edx: high=0
    lea     edi, [rdx+1]    ; edi: low=1
    ;jrcxz  .end            ; handle len=0?  unlike rep, loop only checks rcx after decrementing
.intloop:
    add     edi, eax        ; low += buf[i]
    add     edx, edi        ; high += low
    lodsd                   ; load buf[i+1] for next iteration
    loop   .intloop
.end:
    ;; exit when ecx = 0, eax = terminating 0

    xchg    eax, edx
    ;cdq               ; edx=0 already, ready for div
    ; same as the char version

I also made an untested version that uses scasd (1B version of add edi,4) and add eax, [rdi] instead of lodsd, but it's also 30 bytes. The savings from having high in eax at the end of the loop are balanced out by larger code elsewhere. It has the advantage of not depending on a terminating 0 element in the input, though, which is maybe unreasonable for an unpacked array where we're also given the length explicitly.


C++11 test driver

See the github link. This answer was getting too big, and the test driver got more features with bigger code.