How likely is a collision using MD5 compared to SHA256 (for checking file integrity)?

There seems to be some confusion about the capabilities of a collision attack.

Two of the properties a cryptographic hash must have are collision resistance and preimage resistance.

If a hash is collision resistant, it means that an attacker will be unable to find any two inputs that result in the same output. If a hash is preimage resistant, it means an attacker will be unable to find an input that has a specific output. MD5 has been vulnerable to collisions for a great while now, but it is still preimage resistant.


What does this mean for integrity?

If you trust that the party that originally hashed the data to provide you with the integrity check is not malicious, and they did not allow anyone to modify the data beforehand (any part of the data, even if 2 images, videos, or pdfs look identical they can be vastly different), then MD5 should be sufficient to verify integrity, and SHA-256 shouldn't offer much more security (barring any future attacks on MD5's preimage resistance).

If an attacker may have been able to make any modifications to the data (even seemingly benign modifications), then SHA-256 will be more secure, as with MD5 the attacker could have crafted a malicious file with the same hash.


Are these integrity checks useful?

In many cases, not really. If you're downloading the file over HTTPS from the same website providing the hash value, then you're already benefiting from the MAC TLS uses for authenticity checking, so a MitM will be unable to change the file in-transit. If someone is able to modify the file on the site maliciously, they can also modify the hash.

One case where it does make sense to verify an MD5 or SHA-256 hash for a file is if you download the file from a mirror and check the hash against one provided by the original trusted site.


I wonder how much safer is the use of the SHA256 hashes for integrity checks?

Note: Consider the file content as random input (no attacks)

Based on your note of "no attacks" it seems to me that you are asking:

"What is the probability that a random change (e.g., bit flip during download) to a file will result in creating a new/different file with the same checksum as the original file?"

For the case of MD5, this probability is: 1/(2^128) = 2.94e-39 = 0.00000000000000000000000000000000000000294

For the case of SHA256, this probability is: 1/(2^256) = 8.64e-78 = 0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000864


Important Caveat: In the above-mentioned hypothetical case of random changes, both MD5 and SHA256 are fine choices. However, in real life, the MD5 hash function is frowned upon because it has been broken (collisions have been found). So, the real life advice is: use SHA256 not MD5 for file integrity.


Update based on comments: I'm referring to MD5 as "broken" to mean (basically) that collisions have been found. One of the main conjectured properties of MD5 was "that it is computationally infeasible to produce two messages having the same message digest..." (RFC 1321) Because it is possible to violate this property, I've called MD5 "broken," which is perhaps a little harsh. I still see MD5 used all the time, and I still use it myself all the time. It is fine to use MD5 in certain circumstances, especially when there is no other option.


MD5 collision vulnerabilities exist and it's feasible to intentionally generate 2 files with identical MD5 sums.

No SHA256 collisions are known, and unless a serious weakness exists in the algorithm, it's extremely unlikely one will be found.

For verifying a file was not accidentally corrupted, MD5 is probably sufficient. If it's possible it was intentionally altered, MD5 isn't safe and you should stick with SHA256.