Tweetable hash function challenge

Python, 5333 4991

I believe this is the first contender to score significantly better than a random oracle.

def H(s):n=int(s.encode('hex'),16);return n%(8**8-ord('+%:5O![/5;QwrXsIf]\'k#!__u5O}nQ~{;/~{CutM;ItulA{uOk_7"ud-o?y<Cn~-`bl_Yb'[n%70]))

Python 2, 140 bytes, 4266 colliding words

I didn’t really want to start with the non-printable bytes thing given their unclear tweetability, but well, I didn’t start it. :-P

00000000: efbb bf64 6566 2066 2873 293a 6e3d 696e  ...def f(s):n=in
00000010: 7428 732e 656e 636f 6465 2827 6865 7827  t(s.encode('hex'
00000020: 292c 3336 293b 7265 7475 726e 206e 2528  ),36);return n%(
00000030: 382a 2a38 2b31 2d32 3130 2a6f 7264 2827  8**8+1-210*ord('
00000040: 6f8e 474c 9f5a b49a 01ad c47f cf84 7b53  o.GL.Z........{S
00000050: 49ea c71b 29cb 929a a53b fc62 3afb e38e  I...)....;.b:...
00000060: e533 7360 982a 50a0 2a82 1f7d 768c 7877  .3s`.*P.*..}v.xw
00000070: d78a cb4f c5ef 9bdb 57b4 7745 3a07 8cb0  ...O....W.wE:...
00000080: 868f a927 5b6e 2536 375d 2929            ...'[n%67]))

Python 2, 140 printable bytes, 4662 4471 4362 colliding words

def f(s):n=int(s.encode('hex'),16);return n%(8**8+3-60*ord('4BZp%(jTvy"WTf.[Lbjk6,-[LVbSvF[Vtw2e,NsR?:VxC0h5%m}F5,%d7Kt5@SxSYX-=$N>'[n%71]))

Inspired by the form of kasperd’s solution, obviously—but with the important addition of an affine transformation on the modulus space, and entirely different parameters.


CJam, 4125 3937 3791 3677

0000000: 7b 5f 39 62 31 31 30 25 5f 22 7d 13 25 77  {_9b110%_"}.%w
000000e: 77 5c 22 0c e1 f5 7b 83 45 85 c0 ed 08 10  w\"...{.E.....
000001c: d3 46 0c 5c 22 59 f8 da 7b f8 18 14 8e 4b  .F.\"Y..{....K
000002a: 3a c1 9e 97 f8 f2 5c 18 21 63 13 c8 d3 86  :.....\.!c....
0000038: 45 8e 64 33 61 50 96 c4 48 ea 54 3b b3 ab  E.d3aP..H.T;..
0000046: bc 90 bc 24 21 20 50 30 85 5f 7d 7d 59 2c  ...$! P0._}}Y,
0000054: 4a 67 88 c8 94 29 1a 1a 1a 0f 38 c5 8a 49  Jg...)....8..I
0000062: 9b 54 90 b3 bd 23 c6 ed 26 ad b6 79 89 6f  .T...#..&..y.o
0000070: bd 2f 44 6c f5 3f ae af 62 9b 22 3d 69 40  ./Dl.?..b."=i@
000007e: 62 31 35 32 35 31 39 25 31 31 30 2a 2b 7d  b152519%110*+}

This approach divides domain and codomain into 110 disjoint sets, and defines a slightly different hash function for each pair.

Scoring / Verification

$ echo $LANG
en_US
$ cat gen.cjam
"qN%{_9b110%_"
[125 19 37 119 119 34 12 225 245 123 131 69 133 192 237 8 16 211 70 12 34 89 248 218 123 248 24 20 142 75 58 193 158 151 248 242 92 24 33 99 19 200 211 134 69 142 100 51 97 80 150 196 72 234 84 59 179 171 188 144 188 36 33 32 80 48 133 95 125 125 89 44 74 103 136 200 148 41 26 26 26 15 56 197 138 73 155 84 144 179 189 35 198 237 38 173 182 121 137 111 189 47 68 108 245 63 174 175 98 155]
:c`"=i@b152519%110*+}%N*N"
$ cjam gen.cjam > test.cjam
$ cjam test.cjam < british-english-huge.txt | sort -n > temp
$ head -1 temp
8
$ tail -1 temp
16776899
$ all=$(wc -l < british-english-huge.txt)
$ unique=$(uniq -u < temp | wc -l)
$ echo $[all - unique]
3677

The following port to Python can be used with the official scoring snippet:

h=lambda s,b:len(s)and ord(s[-1])+b*h(s[:-1],b)

def H(s):
 p=h(s,9)%110
 return h(s,ord(
  '}\x13%ww"\x0c\xe1\xf5{\x83E\x85\xc0\xed\x08\x10\xd3F\x0c"Y\xf8\xda{\xf8\x18\x14\x8eK:\xc1\x9e\x97\xf8\xf2\\\x18!c\x13\xc8\xd3\x86E\x8ed3aP\x96\xc4H\xeaT;\xb3\xab\xbc\x90\xbc$! P0\x85_}}Y,Jg\x88\xc8\x94)\x1a\x1a\x1a\x0f8\xc5\x8aI\x9bT\x90\xb3\xbd#\xc6\xed&\xad\xb6y\x89o\xbd/Dl\xf5?\xae\xafb\x9b'
  [p]))%152519*110+p