Fast hash for strings

Python has a built-in hash() function that's very fast and perfect for most uses:

>>> hash("dfds")
3591916071403198536

You can then make it unsigned:

>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value

You can then turn it into a 16 byte hex string:

>>> hashu("dfds").to_bytes(8,"big").hex()

Or an N*2 byte string, where N is <= 8:

>>> hashn=lambda word, N  : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()

..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.

>>> hashnbig=lambda word, N  : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()

And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you

>>> hashnbigu=lambda word, N  : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbigu("foo",16)
'ZblnvrRqHwAy2lnvrR4HrA'

Caveats:

  • Be warned that in Python 3.3 and up, this function is randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0

  • See https://github.com/flier/pyfasthash for fast, stable hashes that that similarly won't overload your CPU for non-cryptographic applications.

  • Don't use this lambda style in real code... write it out! And stuffing things like 2**32 in your code, instead of making them constants is bad form.

  • In the end 8 bytes of collision resistance is OK for a smaller applications.... with less than a million entries, you've got collision odds of < 0.0000001%. That's a 12 byte b64 encoded string. But it might not be enough for larger apps.

  • 16 bytes is enough for a UUID/OID in a cache, etc.

Speed comparison for producing 300k 16 byte hashes from a bytes-input.

builtin: 0.188
md5: 0.359
fnvhash_c: 0.113

For a complex input (tuple of 3 integers, for example), you have to convert to bytes to use the non-builtin hashes, this adds a lot of conversion overhead, making the builtin shine.

builtin: 0.197
md5: 0.603
fnvhash_c: 0.284

I guess this question is off-topic, because opinion based, but at least one hint for you, I know the FNV hash because it is used by The Sims 3 to find resources based on their names between the different content packages. They use the 64 bits version, so I guess it is enough to avoid collisions in a relatively large set of reference strings. The hash is easy to implement, if no module satisfies you (pyfasthash has an implementation of it for example).

To get a short string out of it, I would suggest you use base64 encoding. For example, this is the size of a base64-encoded 64 bits hash: nsTYVQUag88= (and you can get rid or the padding =).

Edit: I had finally the same problem as you, so I implemented the above idea: https://gist.github.com/Cilyan/9424144