Why are MD5 and SHA-1 still used for checksums and certificates if they are called broken?

SHA-1 and MD5 are broken in the sense that they are vulnerable to collision attacks. That is, it has become (or, for SHA-1, will soon become) realistic to find two strings that have the same hash.

As explained here, collision attacks do not directly affect passwords or file integrity because those fall under the preimage and second preimage case, respectively.

However, MD5 and SHA-1 are still less computationally expensive. Passwords hashed with these algorithms are easier to crack than the stronger algorithms that currently exist. Although not specifically broken, using stronger algorithms is advisable.

In the case of certificates, signatures state that a hash of a particular certificate is valid for a particular website. But, if you can craft a second certificate with that hash, you can impersonate other websites. In the case of MD5, this has already happened, and browsers will be phasing out SHA-1 soon as a preventative measure (source).

File integrity checking is often intended to ensure that a file was downloaded correctly. But, if it is being used to verify that the file was not maliciously tampered with, you should consider an algorithm that is more resilient to collisions (see also: chosen-prefix attacks).


For MD5, no one who is both reputable and competent is using it in a context where collision-resistance is important. For SHA-1, it's being phased out; the SHA-1 break was not practical when it was released, and only now is it becoming important to think about phasing it out where collision-resistance is needed. In fact, it is being phased out; for instance, long-term TLS certificates with SHA-1 no longer work in Chrome, to prod people into changing to SHA-2. However, it's not practically broken yet, so it's acceptable for now.

The reason why it wasn't dropped for everything immediately is because security involves tradeoffs. You don't drop a major standard and make everything incompatible with a giant install base on the grounds of something that might lead to practical attacks in a decade's time. Compatibility matters.

Also, for many uses, MD5 and SHA-1 aren't cracked at all. They both have weaknesses against collision-resistance, meaning an attacker can create two messages that hash to the same thing. Neither is broken against preimage resistance (given a hash, find something that makes that hash), or against second-preimage resistance (given a message, find a different message with the same hash), or (their compression functions) as pseudo-random functions. That means that constructions like HMAC-MD5 can still be secure, because it doesn't rely on the property of MD5 that's broken. Less than ideal, sure, but see "compatibility matters if it's still secure" above.

File integrity checking via hashes is almost always pointless anyway; unless the hashes are sent over a more secure channel than the file, you can tamper with the hashes as easily as with the file. However, if the hashes are sent more securely than the file, MD5 and SHA-1 are still capable of protecting file integrity. Because the attacker doesn't have any influence over the legitimate files (and there needs to be zero influence to really be safe), creating a new file with the same hash requires breaking second preimage-resistance, which no one has done for MD5 or SHA-1.

Note the difference between integrity checking and certificates. Certificates are issued by a CA from a user-created CSR; the attacker can have huge influence over the actual certificate contents, so a collision attack allows an attacker to create a legit and a fake certificate that collide, get the legit one issued, and use the signature on the fake one. In contrast, in file integrity the attacker normally has zero control over the legitimate file, so needs to get a collision with a given file, which is much harder (and which as far as we know can't be done with MD5).


MD5 and SHA-1 are fast and may be supported in hardware, in contrast to newer, more secure hashes (though Bitcoin probably changed this with its use of SHA-2 by giving rise to mining chips that compute partial SHA-2 collisions).

MD5 collisions are feasible and preimage attack advances have been made; there is a publicly known SHA-1 collision for the official full-round algorithm, after other attacks significantly reducing its effective complexity, which may not yet be practical enough for the casual attacker but in the realm of possibility, which is why it can be called broken.

Nonetheless, "weak" or broken hashes can still be good for uses that do not need cryptographically secure algorithms, but many purposes that were not originally considered to be critical later can turn out to expose a potential attack surface.

Good examples would be finding duplicate files or use in version control systems like git - in most cases, you want good performance with high reliability, but do not need tight security - giving someone write access to an official git repository already requires you to trust other people to not mess around, and duplication checks should additionally compare the contents after finding that two files have the same size and hash.

Not sufficiently backing up insecure hashes with facts (e.g. byte-by-byte comparison) can be a risk, e.g. when someone like Dropbox had deduplication with MD5 without proper verification and an attacker sneaks in data with colliding hashes to cause data loss.

git addresses this issue by "trusting the elder", as Linus Himself said:

if you already have a file A in git with hash X is there any condition where a remote file with hash X (but different contents) would overwrite the local version?

Nope. If it has the same SHA1, it means that when we receive the object from the other end, we will not overwrite the object we already have.

So what happens is that if we ever see a collision, the "earlier" object in any particular repository will always end up overriding. But note that "earlier" is obviously per-repository, in the sense that the git object network generates a DAG that is not fully ordered, so while different repositories will agree about what is "earlier" in the case of direct ancestry, if the object came through separate and not directly related branches, two different repos may obviously have gotten the two objects in different order.

However, the "earlier will override" is very much what you want from a security standpoint: remember that the git model is that you should primarily trust only your own repository. So if you do a "git pull", the new incoming objects are by definition less trustworthy than the objects you already have, and as such it would be wrong to allow a new object to replace an old one.

[Original source: https://marc.info/?l=git&m=115678778717621&w=2]

And as they say, a disk failure is waaaaayy more likely than encountering an accidental hash collision (several orders of magnitude - SHA-1 collision < 10-40; disk non-recoverable bit error ~ 10-15).

Tags:

Hash

Md5