What is the fastest way to check if files are identical?

Most people in their responses are ignoring the fact that the files must be compared repeatedly. Thus the checksums are faster as the checksum is calculated once and stored in memory (instead of reading the files sequentially n times).


Assuming that the expectation is that the files will be the same (it sound like that's the scenario), then dealing with checksums/hashes is a waste of time - it's likely that they'll be the same and you'd have to re-read the files to get the final proof (I'm also assuming that since you want to "prove ... they are the same", that having them hash to the same value is not good enough).

If that's the case I think that the solution proposed by David is pretty close to what you'd need to do. A couple things that could be done to optimize the comparison, in increasing level of complexity:

  • check if the file sizes are the same before doing the compare
  • use the fastest memcmp() that you can (comparing words instead of bytes - most C runtimes should do this already)
  • use multiple threads to do the memory block compares (up to the number of processors available on the system, going over that would cause your thread to fight each other)
  • use overlapped/asynchronous I/O to keep the I/O channels as busy as possible, but also profile carefully so you thrash between the files as little as possible (if the files are divided among several different disks and I/O ports, all the better)

Update: Don't get stuck on the fact they are source files. Pretend for example you took a million runs of a program with very regulated output. You want to prove all 1,000,000 versions of the output are the same.

if you have control over the output have the program creating the files / output create an md5 on the fly and embed it in the file or output stream or even pipe the output through a program that creates the md5 along the way and stores it along side the data somehow, point is to do the calculations when the bytes are already in memory.

if you can't pull this off then like others have said, check file sizes then do a straight byte by byte comparison on same sized files, i don't see how any sort of binary division or md5 calculation is any better than a straight comparison, you will have to touch every byte to prove equality any way you cut it so you might as well cut the amount of computation needed per byte and gain the ability to cut off as soon as you find a mis-match.

the md5 calculation would be useful if you plan to compare these again later to new outputs but your basically back to my first point of calculating the md5 as soon as possible


I'd opt for something like the approach taken by the cmp program: open two files (say file 1 and file 2), read a block from each, and compare them byte-by-byte. If they match, read the next block from each, compare them byte-by-byte, etc. If you get to the end of both files without detecting any differences, seek to the beginning of file 1, close file 2 and open file 3 in its place, and repeat until you've checked all files. I don't think there's any way to avoid reading all bytes of all files if they are in fact all identical, but I think this approach is (or is close to) the fastest way to detect any difference that may exist.

OP Modification: Lifted up important comment from Mark Bessey

"another obvious optimization if the files are expected to be mostly identical, and if they're relatively small, is to keep one of the files entirely in memory. That cuts way down on thrashing trying to read two files at once."