Which hashing algorithm shoud I use for a safe file checksum?

Choice of hash algorithm

Use SHA-256 or SHA-512: either of the two “main” members of the SHA-2 family. SHA-2 is the successor of SHA-1 and is considered secure. It's the hash to choose unless you have a good reason to choose otherwise. In your case the choise between SHA-256 and SHA-512 is indifferent. There is a SHA-3 but it isn't very widely supported yet and it isn't more secure (or less secure) than SHA-2, it's just a different design.

Do not use MD5 or SHA-1. They are not obviously unsuitable in your scenario, but they could be exploited with a bit of extra work. Furthermore the fact that these algorithms are already partially broken makes them more at risk of getting more broken over time.

More precisely, for both of these hashes, it is possible to find collisions: it is possible to find two documents D1 and D2 such that MD5(D1) = MD5(D2) (or SHA-1(D1) = SHA-1(D2)), and such that D1 and D2 each end with a small bit that needs to be calculated and optionally a common chosen suffix. The bit that needs to be calculated will look like garbage, but it can be hidden in a comment, in an image that's shifted off-page, etc. Producing such collisions is trivial on a PC for MD5 and is doable but expensive for SHA-1 (unless you want it for two PDF files, in which case researchers have already spent the money on the calculation to find one and published it).

In your scenario, you mostly don't care about collisions, because you'll be producing D1. You aren't going to craft this bit in the middle. However, there's a risk that somebody could trick you into injecting this bit, for example by supplying an image to include in the document. It would be pretty tricky to achieve a collision that way, but it's doable in principle.

Since there's risk in using MD5, and zero benefit compared to using SHA-256, use SHA-256.

What to do with a hash

With a non-broken cryptographic hash like SHA-256, what you know is that if two files have the same hash then they're identical. Conversely, this means that if two files have different hashes, then they're different. This means that if you keep a trusted copy of the hash (for example you print it out and store it, or notarize it), then you can tell later “yes, this file you're showing me is the same file” or “no, this file you're showing me is different”.

Knowing the hash of the file doesn't prove that you wrote it. There's no cryptographic way to prove authorship. The best you can do is to prove that you had the file earlier than anyone else who can prove it. You can do that without revealing the file by communicating the hash to a third party who everyone trusts to correctly remember the date at which you showed them the hash; this third party could be a public notary, or the Wayback Machine if you put the hash on a web page that it indexes. (If you publish the hash, then in theory someone could figure out the file from it, but there's no better way to do that than to try all plausible files until they find the right one. If you are concerned about this then use a signature of the file instead of a hash, and notarize the signature and the public key but keep the private key to yourself.)

Example of something a hash is good for: your customer wants support, but you're only prepared to support your original product and not a modified product. So you get them to calculate the hash of what they want you to support. If the hash value is not what you provided, you refuse to provide support. Note that you need to trust the customer to calculate the hash of the product, and not calculate the hash of some copy of the original or read it off the delivery slip.

Example of something a hash is not good for: somebody else claims that they're the author of the document. You say “no, look, I know its hash, it's 1234…”. That doesn't help: anybody can calculate the hash.

Example of something a hash is good for if used appropriately: somebody else claims that they just wrote the document. You say “no, look, I notarized the hash 6 last year, so you can't have written it last week”.

Example of something a hash is not good for: somebody makes a slight modification of the document. It'll then have a different hash. All you can say is that the document is now different, but that doesn't convey any information about how different they are. The hash of a completely different document is just as different as the hash of a version with a typo fix, or a version that's encoded differently.


For ensuring that work product is unchanged, even MD5 is reasonable.

The ability of an attacker to engineer a collision is dangerous when they may, for example, generate an executable. That executable may take 500 Kb to do something bad, and spend another 50,000 Kb spinning out unused bits just to get the collision. That's okay if those bits are unused; you simply see an executable with the right hash, and you're fooled.

To engineer a collision that both matches the MD5 hash -and- represents credibly incorrect documentation is not feasible. You're more likely to end up with documentation that reads "Take the plug and insert it into the $#WG%ga 940[2aj2'rj09[3j59g;qa1j; socket" - anybody who looks at that will realize the documentation has been tampered with. Even a phased array of Shakespearean monkeys can't spin an MD5 collision that still looks like documentation.

Looking more closely, I see it's not the documentation you're protecting; you'd include the hash of the "specific file which will be the result of the work I do for them". Again, not knowing that that file is - executable? source code? - it is computationally infeasible that they could modify it in such a way as to credibly claim it is what you gave them, and engineer a hash collision at the same time.

See also this answer on Crypto.SE which summarizes:

MD5 is currently considered too weak to work as a cryptographic hash. However, for all traditional (i.e. non-cryptographic) hash uses MD5 is often perfectly fine.

You're not looking at a cryptographic use of a hash, so MD5 is fine for you. It will prevent replacement of modified or credibly forged replacements of the work product you've provided them.