What is the fastest hash algorithm to check if two files are equal?

One approach might be to use a simple CRC-32 algorithm, and only if the CRC values compare equal, rerun the hash with a SHA1 or something more robust. A fast CRC-32 will outperform a cryptographically secure hash any day.


xxhash purports itself as quite fast and strong, collision-wise:

http://cyan4973.github.io/xxHash/

There is a 64 bit variant that runs "even faster" on 64 bit processors than the 32, overall, though slower on 32-bit processors (go figure).

http://code.google.com/p/crcutil is also said to be quite fast (and leverages hardware CRC instructions where present, which are probably very fast, but if you don't have hardware that supports them, aren't as fast). Don't know if CRC32c is as good of a hash (in terms of collisions) as xxHash or not...

https://code.google.com/p/cityhash/ seems similar and related to crcutil [in that it can compile down to use hardware CRC32c instructions if instructed].

If you "just want the fastest raw speed" and don't care as much about quality of random distribution of the hash output (for instance, with small sets, or where speed is paramount), there are some fast algorithms mentioned here: http://www.sanmayce.com/Fastest_Hash/ (these "not quite random" distribution type algorithms are, in some cases, "good enough" and very fast). Apparently FNV1A_Jesteress is the fastest for "long" strings, some others possibly for small strings. http://locklessinc.com/articles/fast_hash/ also seems related. I did not research to see what the collision properties of these are.

Latest hotness seems to be https://github.com/erthink/t1ha and https://github.com/wangyi-fudan/wyhash and xxhash also has a slightly updated version as well.


Unless you're using a really complicated and/or slow hash, loading the data from the disk is going to take much longer than computing the hash (unless you use RAM disks or top-end SSDs).

So to compare two files, use this algorithm:

  • Compare sizes
  • Compare dates (be careful here: this can give you the wrong answer; you must test whether this is the case for you or not)
  • Compare the hashes

This allows for a fast fail (if the sizes are different, you know that the files are different).

To make things even faster, you can compute the hash once and save it along with the file. Also save the file date and size into this extra file, so you know quickly when you have to recompute the hash or delete the hash file when the main file changes.

Tags:

File

Hash

Crc