Basic Latin character name to character

IA-32 machine code, 161 160 122 bytes

Hexdump of the code:

33 c0 6b c0 59 0f b6 11 03 c2 b2 71 f6 f2 c1 e8
08 41 80 79 01 00 75 ea e8 39 00 00 00 08 2c 5e
4a bd a3 cd c5 90 09 46 04 06 14 40 3e 3d 5b 23
60 5e 3f 2d 31 32 29 25 2e 3c 7e 36 39 34 33 30
21 2f 26 7d 7c 2c 3b 7b 2a 37 5d 22 35 20 3a 28
5c 27 2b 38 5f 24 5a 3c 34 74 17 3c 1a 74 16 33
c9 86 c4 0f a3 0a 14 00 41 fe cc 75 f6 8a 44 02
0e c3 8a 01 c3 8a 01 04 20 c3

This code uses some hashing. By some brute-force search, I found that the following hash function can be applied to the bytes of the input string:

int x = 0;
while (s[1])
{
    x = (x * 89 + *s) % 113;
    ++s;
}

It multiplies x by 89, adds the next byte (ASCII-code), and takes a remainder modulo 113. It does this on all bytes of the input string except the last one, so e.g. LATIN CAPITAL LETTER A and LATIN CAPITAL LETTER X give the same hash code.

This hash function has no collisions, and the output is in the range 0...113 (actually, by luck, the range is even narrower: 3...108).

The hash values of all relevant strings don't fill that space completely, so I decided to use this to compress the hash table. I added a "skip" table (112 bits), which contains 0 if the corresponding place in the hash table is empty, and 1 otherwise. This table converts a hash value into a "compressed" index, which can be used to address a dense LUT.

The strings LATIN CAPITAL LETTER and LATIN SMALL LETTER give hash codes 52 and 26; they are handled separately. Here is a C code for that:

char find(const char* s)
{
    int hash = 0;
    while (s[1])
    {
        hash = (hash * 89 + *s) % 113;
        ++s;
    }

    if (hash == 52)
        return *s;
    if (hash == 26)
        return *s + 32;

    int result_idx = 0;
    int bit = 0;
    uint32_t skip[] = {0x4a5e2c08, 0xc5cda3bd, 0x04460990, 0x1406};
    do {
        if (skip[bit / 32] & (1 << bit % 32))
            ++result_idx;
        ++bit;
    } while (--hash);

    return "@>=[#`^?-12)%.<~69430!/&}|,;{*7]\"5 :(\\'+8_$"[result_idx];
}

The corresponding assembly language code (MS Visual Studio inline-assembly syntax):

_declspec(naked) char _fastcall find(char* s)
{
    _asm {
        xor eax, eax;
    mycalc:
        imul eax, eax, 89;
        movzx edx, [ecx];
        add eax, edx;
        mov dl, 113;
        div dl;
        shr eax, 8;
        inc ecx;
        cmp byte ptr [ecx + 1], 0;
        jne mycalc;

        call mycont;
        // skip table
        _asm _emit 0x08 _asm _emit 0x2c _asm _emit 0x5e _asm _emit 0x4a;
        _asm _emit 0xbd _asm _emit 0xa3 _asm _emit 0xcd _asm _emit 0xc5;
        _asm _emit 0x90 _asm _emit 0x09 _asm _emit 0x46 _asm _emit 0x04;
        _asm _emit 0x06 _asm _emit 0x14;
        // char table
        _asm _emit '@' _asm _emit '>' _asm _emit '=' _asm _emit '[';
        _asm _emit '#' _asm _emit '`' _asm _emit '^' _asm _emit '?';
        _asm _emit '-' _asm _emit '1' _asm _emit '2' _asm _emit ')';
        _asm _emit '%' _asm _emit '.' _asm _emit '<' _asm _emit '~';
        _asm _emit '6' _asm _emit '9' _asm _emit '4' _asm _emit '3';
        _asm _emit '0' _asm _emit '!' _asm _emit '/' _asm _emit '&';
        _asm _emit '}' _asm _emit '|' _asm _emit ',' _asm _emit ';';
        _asm _emit '{' _asm _emit '*' _asm _emit '7' _asm _emit ']';
        _asm _emit '"' _asm _emit '5' _asm _emit ' ' _asm _emit ':';
        _asm _emit '(' _asm _emit '\\' _asm _emit '\'' _asm _emit '+';
        _asm _emit '8' _asm _emit '_' _asm _emit '$';

    mycont:
        pop edx;
        cmp al, 52;
        je capital_letter;
        cmp al, 26;
        je small_letter;

        xor ecx, ecx;
        xchg al, ah;
    decode_hash_table:
        bt [edx], ecx;
        adc al, 0;
        inc ecx;
        dec ah;
        jnz decode_hash_table;

        mov al, [edx + eax + 14];
        ret;

    capital_letter:
        mov al, [ecx];
        ret;

    small_letter:
        mov al, [ecx];
        add al, 32;
        ret;
    }
}

Some noteworthy implementation details:

  • It uses a CALL instruction to get a pointer to the code, where the hard-coded table resides. In 64-bit mode, it could use the register rip instead.
  • It uses the BT instruction to access the skip table
  • It manages to do the work using only 3 registers eax, ecx, edx, which can be clobbered - so there is no need to save and restore registers
  • When decoding the hash table, it uses al and ah carefully, so that at the right place ah is decreased to 0, and the whole eax register can be used as a LUT index

JavaScript ES6, 228 236 247 257 267 274 287

Note: 7 chars saved thx @ev3commander

Note 2: better than JAPT after 7 major edits,

n=>n<'L'?"XC!DO$MP&OS'SK*N--FU.ZE0TW2HR3OU4FI5IX6EI8NI9EM;LS=R->IA@MF^AV`MM,NE1EN7LO:".replace(/(..)./g,(c,s)=>~n.search(s)?n=c[2]:0)&&n:'~  / ;  |?"\\ ) }]_+ #% < ( {['[(n<'Q')*13+n.length-(n>'T')-4]||n[21]||n[19].toLowerCase()

Run the snippet to test

F=n=>
  n<'L'?"XC!DO$MP&OS'SK*N--FU.ZE0TW2HR3OU4FI5IX6EI8NI9EM;LS=R->IA@MF^AV`MM,NE1EN7LO:"
  .replace(/(..)./g,(c,s)=>~n.search(s)?n=c[2]:0)&&n:
  '~  / ;  |?"\\ ) }]_+ #% < ( {['[(n<'Q')*13+n.length-(n>'T')-4]
  ||n[21]||n[19].toLowerCase()

//TEST
console.log=x=>O.innerHTML+=x+'\n'
;[
['&','AMPERSAND'],
['\'','APOSTROPHE'],
['*','ASTERISK'],
['^','CIRCUMFLEX ACCENT'],
[':','COLON'],
[',','COMMA'],
['@','COMMERCIAL AT'],
['8','DIGIT EIGHT'],
['5','DIGIT FIVE'],
['4','DIGIT FOUR'],
['9','DIGIT NINE'],
['1','DIGIT ONE'],
['7','DIGIT SEVEN'],
['6','DIGIT SIX'],
['3','DIGIT THREE'],
['2','DIGIT TWO'],
['0','DIGIT ZERO'],
['$','DOLLAR SIGN'],
['=','EQUALS SIGN'],
['!','EXCLAMATION MARK'],
['.','FULL STOP'],
['`','GRAVE ACCENT'],
['>','GREATER-THAN SIGN'],
['-','HYPHEN-MINUS'],
['A','LATIN CAPITAL LETTER A'],
['B','LATIN CAPITAL LETTER B'],
['C','LATIN CAPITAL LETTER C'],
['D','LATIN CAPITAL LETTER D'],
['E','LATIN CAPITAL LETTER E'],
['F','LATIN CAPITAL LETTER F'],
['G','LATIN CAPITAL LETTER G'],
['H','LATIN CAPITAL LETTER H'],
['I','LATIN CAPITAL LETTER I'],
['J','LATIN CAPITAL LETTER J'],
['K','LATIN CAPITAL LETTER K'],
['L','LATIN CAPITAL LETTER L'],
['M','LATIN CAPITAL LETTER M'],
['N','LATIN CAPITAL LETTER N'],
['O','LATIN CAPITAL LETTER O'],
['P','LATIN CAPITAL LETTER P'],
['Q','LATIN CAPITAL LETTER Q'],
['R','LATIN CAPITAL LETTER R'],
['S','LATIN CAPITAL LETTER S'],
['T','LATIN CAPITAL LETTER T'],
['U','LATIN CAPITAL LETTER U'],
['V','LATIN CAPITAL LETTER V'],
['W','LATIN CAPITAL LETTER W'],
['X','LATIN CAPITAL LETTER X'],
['Y','LATIN CAPITAL LETTER Y'],
['Z','LATIN CAPITAL LETTER Z'],
['a','LATIN SMALL LETTER A'],
['b','LATIN SMALL LETTER B'],
['c','LATIN SMALL LETTER C'],
['d','LATIN SMALL LETTER D'],
['e','LATIN SMALL LETTER E'],
['f','LATIN SMALL LETTER F'],
['g','LATIN SMALL LETTER G'],
['h','LATIN SMALL LETTER H'],
['i','LATIN SMALL LETTER I'],
['j','LATIN SMALL LETTER J'],
['k','LATIN SMALL LETTER K'],
['l','LATIN SMALL LETTER L'],
['m','LATIN SMALL LETTER M'],
['n','LATIN SMALL LETTER N'],
['o','LATIN SMALL LETTER O'],
['p','LATIN SMALL LETTER P'],
['q','LATIN SMALL LETTER Q'],
['r','LATIN SMALL LETTER R'],
['s','LATIN SMALL LETTER S'],
['t','LATIN SMALL LETTER T'],
['u','LATIN SMALL LETTER U'],
['v','LATIN SMALL LETTER V'],
['w','LATIN SMALL LETTER W'],
['x','LATIN SMALL LETTER X'],
['y','LATIN SMALL LETTER Y'],
['z','LATIN SMALL LETTER Z'],
['{','LEFT CURLY BRACKET'],
['(','LEFT PARENTHESIS'],
['[','LEFT SQUARE BRACKET'],
['<','LESS-THAN SIGN'],
['_','LOW LINE'],
['#','NUMBER SIGN'],
['%','PERCENT SIGN'],
['+','PLUS SIGN'],
['?','QUESTION MARK'],
['"','QUOTATION MARK'],
['\\','REVERSE SOLIDUS'],
['}','RIGHT CURLY BRACKET'],
[')','RIGHT PARENTHESIS'],
[']','RIGHT SQUARE BRACKET'],
[';','SEMICOLON'],
['/','SOLIDUS'],
[' ','SPACE'],
['~','TILDE'],
['|','VERTICAL LINE'],
].forEach(t=>{
  var r=F(t[1]),ok=r==t[0]
  //if (!ok) // uncomment to see just errors
  console.log(r+' ('+t[0]+') '+t[1]+(ok?' OK':' ERROR'))
})
console.log('DONE')
<pre id=O></pre>


Japt, 230 bytes

V=U¯2;Ug21 ªU<'R©Ug19 v ªV¥"DI"©`ze¿twâ¿¿¿¿e¿i`u bUs6,8)/2ªUf"GN" ©"<>+=$#%"g`¤grp¤qºnupe`u bV /2 ªUf"T " ©"[]\{}()"g"QSUCAP"bUg6) ªUf" M" ©"!\"?"g"COE"bUg2) ªV¥"CO"©",:@"g"ANE"bUg4) ª" &'*-./\\;~^`_|"g`spaµp¿豢¿Èögrlove`u bV /2

Each ¿ represents an unprintable Unicode char. Try it online!

Ungolfed:

V=Us0,2;Ug21 ||U<'R&&Ug19 v ||V=="DI"&&"zeontwthfofisiseeini"u bUs6,8)/2||Uf"GN" &&"<>+=$#%"g"legrpleqdonupe"u bV /2 ||Uf"T " &&"[]\{}()"g"QSUCAP"bUg6) ||Uf" M" &&"!\"?"g"COE"bUg2) ||V=="CO"&&",:@"g"ANE"bUg4) ||" &'*-./\\;~^`_|"g"spamapashyfusoreseticigrlove"u bV /2

This was really fun. I've split up the character names into several large chunks:

0. Take first two letters

V=Us0,2; sets variable V to the first two letters of U, the input string. This will come in handy later.

1. Capital letters

This is the easiest: the capital letters are the only ones that have a character at position 21, which all happen to be the correct letter and case. Thus, Ug21 is sufficient.

2. Lowercase letters

Another fairly easy one; the only other name that has a character at position 19 is RIGHT SQUARE BRACKET, so we check if the name is comes before R with U<'R, then if it is (&&), we take the 19th char with Ug19 and cast it to lowercase with v.

3. Digits

These names all start with DI (and fortunately, none of the others), so if V=="DI", we can turn it into a digit. The first letters of some of the digits' names are the same, but the first two letters are sufficient. Combining these into one string, we get ZEONTWTHFOFISISEEINI. Now we can just take the index b of the first two chars in the digit's name with Us6,8)and divide by two.

4. SIGN

There are seven names that contain SIGN:

<    LESS-THAN SIGN
>    GREATER-THAN SIGN
+    PLUS SIGN
=    EQUALS SIGN
$    DOLLAR SIGN
#    NUMBER SIGN
%    PERCENT SIGN

First we check that it the name contains the word SIGN. It turns out GN is sufficient; Uf"GN" returns all instances of GN in the name, which is null if it contains 0 instances, and thus gets skipped.

Now, using the same technique as with the digits, we combine the first two letters into a string LEGRPLEQDONUPE, then take the index and divide by two. This results an a number from 0-6, which we can use to take the corresponding character from the string <>+=$#%.

5. MARK

There are three characters that contain MARK:

!    EXCLAMATION MARK
"    QUOTATION MARK
?    QUESTION MARK

Here we use the same technique as with SIGN.  M is enough to differentiate these three from the others. To translate to a symbol, this time checking one letter is enough: the character at position 2 is different for all three characters. This means we don't have to divide by two when choosing the correct character.

6. LEFT/RIGHT

This group contains the brackets and parentheses, []{}(). It would be really complicated to capture both LEFT and RIGHT, but fortunately, they all contain the string . We check this with the same technique as we did with SIGN. To translate to a symbol, as with MARK, checking one letter is enough; the character at position 6 is unique for all six.

7. CO

The rest of the chars are pretty unique, but not unique enough. Three of them start with CO: COMMA, COLON, and COMMERCIAL AT. We use exactly the same technique as we did with the brackets, choosing the proper symbol based on the character at position 4 (A, N, or E).

8. Everything else

By now, the first two characters are different for every name. We combine them all into one big string SPAMAPASHYFUSORESETICIGRLOVE and map each pair to its corresponding char in  &'*-./\;~^`_|.

9. Final steps

Each of the parts returns an empty string or null if it's not the correct one, so we can link them all from left to right with ||. The || operator returns the left argument if it's truthy, and the right argument otherwise. Japt also has implicit output, so whatever the result, it is automatically sent to the output box.

Questions, comments, and suggestions welcome!