User Appreciation Challenge #1: Dennis ♦
Jelly, 11 bytes
Try it online!
Of course a challenge appreciating Dennis must have an answer in a language of his.
05AB1E, 9 bytes
Try it online!
-2 thanks to Adnan.
This attacks Jelly's pain point exactly. It uses
A, 1-byte equivalents for Jelly's
x86 32-bit machine-code function,
42 41 bytes
Currently the shortest non-golfing-language answer, 1B shorter than @streetster's q/kdb+.
With 0 for truthy and non-zero for falsy:
41 40 bytes. (in general, saves 1 byte for 32-bit, 2 bytes for 64-bit).
With implicit-length strings (C-style 0-terminated):
45 44 bytes
x86-64 machine-code (with 32-bit pointers, like the x32 ABI):
44 43 bytes.
x86-64 with implicit-length strings, still 46 bytes (the shift/mask bitmap strategy is break-even now).
This is a function with C signature
_Bool dennis_like(size_t ecx, const char *esi). The calling convention is slightly non-standard, close to MS vectorcall/fastcall but with different arg registers: string in ESI and the length in ECX. It only clobbers its arg-regs and EDX. AL holds the return value, with the high bytes holding garbage (as allowed by the SysV x86 and x32 ABIs. IDK what MS's ABIs say about high-garbage when returning bool or narrow integers.)
Explanation of the algorithm:
Loop over the input string, filtering and classifying into a boolean array on the stack: For each byte, check if it's an alphabetic character (if not, continue to next char), and transform it to an integer from 0-25 (A-Z). Use that 0-25 integer to check a bitmap of vowel=0/consonant=1. (The bitmap is loaded into a register as a 32-bit immediate constant). Push 0 or 0xFF onto the stack according to the bitmap result (actually in the low byte of a 32-bit element, which may have garbage in the top 3 bytes).
The first loop produces an array of 0 or 0xFF (in dword elements padded with garbage). Do the usual palindrome check with a second loop that stops when pointers cross in the middle (or when they both point to the same element if there were an odd number of alphabetic characters). The upward-moving pointer is the stack pointer, and we use POP to load+increment. Instead of compare / setcc in this loop, we can just use XOR to detect same/different since there are only two possible values. We could accumulate (with OR) whether we found any non-matching elements, but an early-out branch on flags set by XOR is at least as good.
Notice that the second loop uses
byte operand-size, so it doesn't care what garbage the first loop leaves outside the low byte of each array element.
It uses the undocumented
salc instruction to set AL from CF, in the same way that
sbb al,al would. It's supported on every Intel CPU (except in 64-bit mode), even Knight's Landing! Agner Fog lists timings for it on all AMD CPUs as well (including Ryzen), so if x86 vendors insist on tying up that byte of opcode space ever since 8086, we might as well take advantage of it.
- unsigned-compare trick for a combined isalpha() and toupper(), and zero-extends the byte to fill eax, setting up for:
- immediate bitmap in a register for
bt, inspired by some nice compiler output for
- Creating a variable-sized array on the stack with push in a loop. (Standard for asm, but not something you can do with C for the implicit-length string version). It uses 4 bytes of stack space for every input character, but saves at least 1 byte vs. optimal golfing around
- Instead of cmp/setne on the boolean array, XOR booleans together to get a truth value directly. (
salcisn't an option, because
salconly works for CF, and 0xFF-0 doesn't set CF.
seteis 3 bytes, but would avoid the
incoutside the loop, for a net cost of 2 bytes (1 in 64-bit mode)) vs. xor in the loop and fixing it with inc.
; explicit-length version: input string in ESI, byte count in ECX 08048060 <dennis_like>: 8048060: 55 push ebp 8048061: 89 e5 mov ebp,esp ; a stack frame lets us restore esp with LEAVE (1B) 8048063: ba ee be ef 03 mov edx,0x3efbeee ; consonant bitmap 08048068 <dennis_like.filter_loop>: 8048068: ac lods al,BYTE PTR ds:[esi] 8048069: 24 5f and al,0x5f ; uppercase 804806b: 2c 41 sub al,0x41 ; range-shift to 0..25 804806d: 3c 19 cmp al,0x19 ; reject non-letters 804806f: 77 05 ja 8048076 <dennis_like.non_alpha> 8048071: 0f a3 c2 bt edx,eax # AL = 0..25 = position in alphabet 8048074: d6 SALC ; set AL=0 or 0xFF from carry. Undocumented insn, but widely supported 8048075: 50 push eax 08048076 <dennis_like.non_alpha>: 8048076: e2 f0 loop 8048068 <dennis_like.filter_loop> # ecx = remaining string bytes ; end of first loop 8048078: 89 ee mov esi,ebp ; ebp = one-past-the-top of the bool array 0804807a <dennis_like.palindrome_loop>: 804807a: 58 pop eax ; read from the bottom 804807b: 83 ee 04 sub esi,0x4 804807e: 32 06 xor al,BYTE PTR [esi] 8048080: 75 04 jne 8048086 <dennis_like.non_palindrome> 8048082: 39 e6 cmp esi,esp ; until the pointers meet or cross in the middle 8048084: 77 f4 ja 804807a <dennis_like.palindrome_loop> 08048086 <dennis_like.non_palindrome>: ; jump or fall-through to here with al holding an inverted boolean 8048086: 40 inc eax 8048087: c9 leave 8048088: c3 ret ;; 0x89 - 0x60 = 41 bytes
This is probably also one of the fastest answers, since none of the golfing really hurts too badly, at least for strings under a few thousand characters where the 4x memory usage doesn't cause a lot of cache-misses. (It might also lose to answers that take an early-out for non-Dennis-like strings before looping over all the characters.)
salc is slower than
setcc on many CPUs (e.g. 3 uops vs. 1 on Skylake), but a bitmap check with
bt/salc is still faster than a string-search or regex-match. And there's no startup overhead, so it's extremely cheap for short strings.
Doing it in one pass on the fly would mean repeating the classification code for the up and down directions. That would be faster but larger code-size. (Of course if you want fast, you can do 16 or 32 chars at a time with SSE2 or AVX2, still using the compare trick by range-shifting to the bottom of the signed range).
Test program (for ia32 or x32 Linux) to call this function with a cmdline arg, and exit with status = return value.
strlen implementation from int80h.org.
; build with the same %define macros as the source below (so this uses 32-bit regs in 32-bit mode) global _start _start: ;%define PTRSIZE 4 ; true for x32 and 32-bit mode. mov esi, [rsp+4 + 4*1] ; esi = argv ;mov rsi, [rsp+8 + 8*1] ; rsi = argv ; For regular x86-64 (not x32) %if IMPLICIT_LENGTH == 0 ; strlen(esi) mov rdi, rsi mov rcx, -1 xor eax, eax repne scasb ; rcx = -strlen - 2 not rcx dec rcx %endif mov eax, 0xFFFFAEBB ; make sure the function works with garbage in EAX call dennis_like ;; use the 32-bit ABI _exit syscall, even in x32 code for simplicity mov ebx, eax mov eax, 1 int 0x80 ; _exit( dennis_like(argv) ) ;; movzx edi, al ; actually mov edi,eax is fine here, too ;; mov eax,231 ; 64-bit ABI exit_group( same thing ) ;; syscall
A 64-bit version of this function could use
sbb eax,eax, which is only 2 bytes instead of 3 for
setc al. It would also need an extra byte for
not at the end (because only 32-bit has 1-byte inc/dec r32). Using the x32 ABI (32-bit pointers in long mode), we can still avoid REX prefixes even though we copy and compare pointers.
setc [rdi] can write directly to memory, but reserving ECX bytes of stack space costs more code-size than that saves. (And we need to move through the output array.
[rdi+rcx] takes one extra byte for the addressing mode, but really we need a counter that doesn't update for filtered characters so it's going to be worse than that.)
This is the YASM/NASM source with
%if conditionals. It can be built with
-felf32 (32-bit code) or
-felfx32 (64-bit code with the x32 ABI), and with implicit or explicit length. I've tested all 4 versions. See this answer for a script to build a static binary from NASM/YASM source.
To test the 64-bit version on a machine without support for the x32 ABI, you can change the pointer regs to 64-bit. (Then simply subtract the number of REX.W=1 prefixes (0x48 bytes) from the count. In this case, 4 instructions need REX prefixes to operate on 64-bit regs). Or simply call it with the
rsp and the input pointer in the low 4G of address space.
%define IMPLICIT_LENGTH 0 ; This source can be built as x32, or as plain old 32-bit mode ; x32 needs to push 64-bit regs, and using them in addressing modes avoids address-size prefixes ; 32-bit code needs to use the 32-bit names everywhere ;%if __BITS__ != 32 ; NASM-only %ifidn __OUTPUT_FORMAT__, elfx32 %define CPUMODE 64 %define STACKWIDTH 8 ; push / pop 8 bytes %else %define CPUMODE 32 %define STACKWIDTH 4 ; push / pop 4 bytes %define rax eax %define rcx ecx %define rsi esi %define rdi edi %define rbp ebp %define rsp esp %endif ; A regular x86-64 version needs 4 REX prefixes to handle 64-bit pointers ; I haven't cluttered the source with that, but I guess stuff like %define ebp rbp would do the trick. ;; Calling convention similar to SysV x32, or to MS vectorcall, but with different arg regs ;; _Bool dennis_like_implicit(const char *esi) ;; _Bool dennis_like_explicit(size_t ecx, const char *esi) global dennis_like dennis_like: ; We want to restore esp later, so make a stack frame for LEAVE push rbp mov ebp, esp ; enter 0,0 is 4 bytes. Only saves bytes if we had a fixed-size allocation to do. ; ZYXWVUTSRQPONMLKJIHGFEDCBA mov edx, 11111011111011111011101110b ; consonant/vowel bitmap for use with bt ;;; assume that len >= 1 %if IMPLICIT_LENGTH lodsb ; pipelining the loop is 1B shorter than jmp .non_alpha .filter_loop: %else .filter_loop: lodsb %endif and al, 0x7F ^ 0x20 ; force ASCII to uppercase. sub al, 'A' ; range-shift to 'A' = 0 cmp al, 'Z'-'A' ; if al was less than 'A', it will be a large unsigned number ja .non_alpha ;; AL = position in alphabet (0-25) bt edx, eax ; 3B %if CPUMODE == 32 salc ; 1B only sets AL = 0 or 0xFF. Not available in 64-bit mode %else sbb eax, eax ; 2B eax = 0 or -1, according to CF. %endif push rax .non_alpha: %if IMPLICIT_LENGTH lodsb test al,al jnz .filter_loop %else loop .filter_loop %endif ; al = potentially garbage if the last char was non-alpha ; esp = bottom of bool array mov esi, ebp ; ebp = one-past-the-top of the bool array .palindrome_loop: pop rax sub esi, STACKWIDTH xor al, [rsi] ; al = (arr[up] != arr[--down]). 8-bit operand-size so flags are set from the non-garbage jnz .non_palindrome cmp esi, esp ja .palindrome_loop .non_palindrome: ; we jump here with al=1 if we found a difference, or drop out of the loop with al=0 for no diff inc eax ;; AL transforms 0 -> 1 or 0xFF -> 0. leave ret ; return value in AL. high bytes of EAX are allowed to contain garbage.
I looked at messing around with DF (the direction flag that controls
scasd and so on), but it just didn't seem to be a win. The usual ABIs require that DF is cleared on function entry and exit. Assuming cleared on entry but leaving it set on exit would be cheating, IMO. It would be nice to use LODSD / SCASD to avoid the 3-byte
sub esi, 4, especially in the case where there's no high-garbage.
Alternate bitmap strategy (for x86-64 implicit-length strings)
It turns out this doesn't save any bytes, because
bt r32,r32 still works with high garbage in the bit-index. It's just not documented the way
bt / sbb to get the bit into / out of CF, use a shift / mask to isolate the bit we want from the bitmap.
%if IMPLICIT_LENGTH && CPUMODE == 64 ; incompatible with LOOP for explicit-length, both need ECX. In that case, bt/sbb is best xchg eax, ecx mov eax, 11111011111011111011101110b ; not hoisted out of the loop shr eax, cl and al, 1 %else bt edx, eax sbb eax, eax %endif push rax
Since this produces 0/1 in AL at the end (instead of 0/0xFF), we can do the necessary inversion of the return value at the end of the function with
xor al, 1 (2B) instead of
dec eax (also 2B in x86-64) to still produce a proper
_Bool return value.
This used to save 1B for x86-64 with implicit-length strings, by avoiding the need to zero the high bytes of EAX. (I had been using
and eax, 0x7F ^ 0x20 to force to upper-case and zero the rest of eax with a 3-byte
and r32,imm8. But now I'm using the 2-byte immediate-with-AL encoding that most 8086 instructions have, like I was already doing for the
It loses to
salc in 32-bit mode, and explicit-length strings need ECX for the count so this doesn't work there either.
But then I realized that I was wrong:
bt edx, eax still works with high garbage in eax. It apparently masks the shift count the same way
shr r32, cl does (looking only at the low 5 bits of cl). This is different from
bt [mem], reg, which can access outside the memory referenced by the addressing-mode/size, treating it as a bitstring. (Crazy CISC...)
Intel's insn set ref manual doesn't document the masking, so maybe it's undocumented behaviour that Intel is preserving for now. (That kind of thing is not uncommon.
bsf dst, src with src=0 always leaves dst unmodified, even though it's documented to leave dst holding an undefined value in that case. AMD actually documents the src=0 behaviour.) I tested on Skylake and Core2, and the
bt version works with non-zero garbage in EAX outside of AL.
A neat trick here is using
xchg eax,ecx (1 byte) to get the count into CL. Unfortunately, BMI2
shrx eax, edx, eax is 5 bytes, vs. only 2 bytes for
shr eax, cl. Using
bextr needs a 2-byte
mov ah,1 (for the number of bits to extract), so it's again 5 + 2 bytes like SHRX + AND.
The source code has gotten pretty messy after adding
%if conditionals. Here's disassembly of x32 implicit-length strings (using the alternate strategy for the bitmap, so it's still 46 bytes).
The main difference from the explicit-length version is in the first loop. Notice how there's a
lods before it, and at the bottom, instead of just one at the top of the loop.
; 64-bit implicit-length version using the alternate bitmap strategy 00400060 <dennis_like>: 400060: 55 push rbp 400061: 89 e5 mov ebp,esp 400063: ac lods al,BYTE PTR ds:[rsi] 00400064 <dennis_like.filter_loop>: 400064: 24 5f and al,0x5f 400066: 2c 41 sub al,0x41 400068: 3c 19 cmp al,0x19 40006a: 77 0b ja 400077 <dennis_like.non_alpha> 40006c: 91 xchg ecx,eax 40006d: b8 ee be ef 03 mov eax,0x3efbeee ; inside the loop since SHR destroys it 400072: d3 e8 shr eax,cl 400074: 24 01 and al,0x1 400076: 50 push rax 00400077 <dennis_like.non_alpha>: 400077: ac lods al,BYTE PTR ds:[rsi] 400078: 84 c0 test al,al 40007a: 75 e8 jne 400064 <dennis_like.filter_loop> 40007c: 89 ee mov esi,ebp 0040007e <dennis_like.palindrome_loop>: 40007e: 58 pop rax 40007f: 83 ee 08 sub esi,0x8 400082: 32 06 xor al,BYTE PTR [rsi] 400084: 75 04 jne 40008a <dennis_like.non_palindrome> 400086: 39 e6 cmp esi,esp 400088: 77 f4 ja 40007e <dennis_like.palindrome_loop> 0040008a <dennis_like.non_palindrome>: 40008a: ff c8 dec eax ; invert the 0 / non-zero status of AL. xor al,1 works too, and produces a proper bool. 40008c: c9 leave 40008d: c3 ret 0x8e - 0x60 = 0x2e = 46 bytes