Is it secure to use MD5 to verify the integrity of small files (less than 15kb)?

The size of the input is irrelevant. In fact, because of the birthday paradox, you don't need any more than the size of the hash to make collisions guaranteed. The best way to avoid collisions is to use a stronger hash which is not vulnerable to them, such as SHA-2. However, you are describing a more difficult attack than a collision attack, called a preimage attack, which MD5 is safe from.

There are three types of attacks* that result in having two files with the same digest:

  • 1st preimage - Find an input that resolves to a specific hash.

  • 2nd preimage - Modify an input without changing the resulting hash.

  • Collision - Find any two distinct inputs that have the same hash.

These are vulnerabilities when they can be carried out more efficiently than by brute force search. Collisions can still occur naturally, and in fact they are guaranteed with any non-trivial amount of input due to the pigeonhole principle, but hashes are designed to make it difficult to intentionally perform. For a hash with an output the size of MD5's, the chance of a random, accidental collision is extremely low. Even if you hash 6 billion random files per second, it would take 100 years before you get a 50% chance of two hashes colliding. MD5 is great for detecting accidental corruption.

A strong n-bit hash function is designed to have a security level of 2n against both 1st and 2nd preimage attacks, and a security level of 2n/2 against collision attacks. For a 128-bit hash like MD5, this means it was designed to have a security level of 2128 against preimages and 264 against collisions. As attacks improve, the actual security level it can provide is slowly chipped away.

MD5 is vulnerable to a collision attack requiring the equivalent of only 218 hash invocations instead of the intended 264 to exploit. Unless the attacker generates both files, it is not a collision attack. An attacker who has a file and wants to maliciously modify it without the hash changing would need to mount a 2nd preimage attack, which is completely infeasible against MD5 with modern technology (the best attack has a complexity of 2123.4, compared to MD5's theoretical maximum of 2128). Collision attacks are relevant in different situations. For example, if you are given an executable made by an attacker without a backdoor, you may hash it and save the hash. That executable could then later be replaced with a backdoored version, yet the hash would be the same as the benign one! This is also a problem for certificates where someone could submit a certificate for a domain they do own, but the certificate would intentionally collide with one for a domain they do not own.

It is safe to use MD5 to verify files as long as the stored hash is not subject to tampering and can be trusted to be correct, and as long as the files being verified were not created (or influenced!) by an attacker. It may still be a good idea to use a stronger hash however, simply to prevent a potential practical preimage attack against MD5 in the future from putting your data at risk. If you want a modern hash that is very fast but still cryptographically secure, you may want to look at BLAKE2.

* While there are other attacks against MD5 such as length extension attacks that affect all Merkle–Damgård hashes as mentioned by @LieRyan, these are not relevant for verifying the integrity of a file against a known-correct hash.

A variant of the collision attack called a chosen-prefix collision attack is able to take two arbitrary messages (prefixes) and find two values that, when appended to each message, results in a colliding digest. This attack is more difficult to pull off than a classic collision attack. Like the length extension attack, this only applies to Merkle–Damgård hashes.


It depends on what you want to defend yourself against

Security is never a one-size-fits-all game. If it were, then there would not be 12941 different hash algorithms. Instead, you need to understand that every security measure defends you against a specific sort of attack. You put a password in your computer to defend against random people accessing it, not because it's so fun to type whereD1DweG0sowron6 whenever you log in.

As for hash algorithms, you can grossly classify them as "cryptographic hashes" and "non-cryptographic hashes". Cryptographic hash algorithms are designed to withstand a number of attacks, while non-cryptographic hashes are designed to be as fast as possible.1 MD5, for example, is considered a cryptographic hash, but so broken that it's only usable as a non-cryptographic hash.

When to use a non-cryptographic hash

If your goal is to detect bit-flips when copying a file from one location to another (say, a thumb drive to a laptop), then MD5 is absolutely the right choice. I would even go as far as saying any fast, non-cryptographic hash is good. When you copy files, you realistically do not need to fear attacker interference. If you are paranoid about hackers being able to modify your kernel, then adding hashes will not solve your problems.

Verifying file integrity with attacker interference

If you intend to sign and publish those files, then an attacker might have the ability to craft a possibly legitimate file with the same hash - meaning that your signature is just as valid on the malicious file.

An example

Let's say your original message m1 looks like this:

I hereby declare that the bunny rules!

You use your hash function h(m1) and gain the digest d1. Afterwards, you sign the digest d1 and get a signature s1.

You then publish your message m1, your signature s1 and your hash function h().

I might be the attacker in the scenario and craft a message m2 that has the exact same hash in your chosen hash function:

It is publicly known that dogs are better than bunnies in every regard...

Since h(m1) = h(m2) = d1, the signature s1 is valid for both your original m1 and my malicious m2.

In order to defend yourself against such attacks, it is vital to choose a strong hash algorithm with high resistance to collisions. This means that it becomes very hard for me to find an m2 where h(m2) = h(m1).

Good choices would include SHA256 and SHA512, as well as tons of others. It seems everyone has some favourite non-mainstream hash functions, but SHA256 and SHA512 have very widespread support and it will be hard for you to find a system that does not support these hashes. And since your files are very small, calculating the hash should be almost instant.

For example, on my 800MHz machine, calculating the SHA512 hash of a 16k random file took 3ms, so even on a toaster it should be relatively quick.


1 You can see the same thing with random number generators. Cryptographic PRNGs aim to deliver random numbers that are really hard to guess, while non-crypto PRNGs aim to just give numbers that look random at first glance and do that fast.

Tags:

Hash

Md5

Sha256