How would Git handle a SHA-1 collision on a blob?

Original answer (2012) (see shattered.io 2017 SHA1 collision below)

That old (2006) answer from Linus might still be relevant:

Nope. If it has the same SHA1, it means that when we receive the object from the other end, we will not overwrite the object we already have.

So what happens is that if we ever see a collision, the "earlier" object in any particular repository will always end up overriding. But note that "earlier" is obviously per-repository, in the sense that the git object network generates a DAG that is not fully ordered, so while different repositories will agree about what is "earlier" in the case of direct ancestry, if the object came through separate and not directly related branches, two different repos may obviously have gotten the two objects in different order.

However, the "earlier will override" is very much what you want from a security standpoint: remember that the git model is that you should primarily trust only your own repository.
So if you do a "git pull", the new incoming objects are by definition less trustworthy than the objects you already have, and as such it would be wrong to allow a new object to replace an old one.

So you have two cases of collision:

the inadvertent kind, where you somehow are very very unlucky, and two files end up having the same SHA1.
At that point, what happens is that when you commit that file (or do a "git-update-index" to move it into the index, but not committed yet), the SHA1 of the new contents will be computed, but since it matches an old object, a new object won't be created, and the commit-or-index ends up pointing to the old object.
You won't notice immediately (since the index will match the old object SHA1, and that means that something like "git diff" will use the checked-out copy), but if you ever do a tree-level diff (or you do a clone or pull, or force a checkout) you'll suddenly notice that that file has changed to something completely different than what you expected.
So you would generally notice this kind of collision fairly quickly.
In related news, the question is what to do about the inadvertent collision..
First off, let me remind people that the inadvertent kind of collision is really really really damn unlikely, so we'll quite likely never ever see it in the full history of the universe.
But if it happens, it's not the end of the world: what you'd most likely have to do is just change the file that collided slightly, and just force a new commit with the changed contents (add a comment saying "/* This line added to avoid collision */") and then teach git about the magic SHA1 that has been shown to be dangerous.
So over a couple of million years, maybe we'll have to add one or two "poisoned" SHA1 values to git. It's very unlikely to be a maintenance problem ;)

The attacker kind of collision because somebody broke (or brute-forced) SHA1.
This one is clearly a lot more likely than the inadvertent kind, but by definition it's always a "remote" repository. If the attacker had access to the local repository, he'd have much easier ways to screw you up.
So in this case, the collision is entirely a non-issue: you'll get a "bad" repository that is different from what the attacker intended, but since you'll never actually use his colliding object, it's literally no different from the attacker just not having found a collision at all, but just using the object you already had (ie it's 100% equivalent to the "trivial" collision of the identical file generating the same SHA1).

The question of using SHA-256 is regularly mentioned, but not act upon for now (2012).
Note: starting 2018 and Git 2.19, the code is being refactored to use SHA-256.

Note (Humor): you can force a commit to a particular SHA1 prefix, with the project gitbrute from Brad Fitzpatrick (bradfitz).

gitbrute brute-forces a pair of author+committer timestamps such that the resulting git commit has your desired prefix.

Example: https://github.com/bradfitz/deadbeef

Daniel Dinnyes points out in the comments to 7.1 Git Tools - Revision Selection, which includes:

A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.

Even the more recently (February 2017) shattered.io demonstrated the possibility of forging a SHA1 collision:
(see much more in my separate answer, including Linus Torvalds' Google+ post)

a/ still requires over 9,223,372,036,854,775,808 SHA1 computations. This took the equivalent processing power as 6,500 years of single-CPU computations and 110 years of single-GPU computations.
b/ would forge one file (with the same SHA1), but with the additional constraint its content and size would produce the identical SHA1 (a collision on the content alone is not enough): see "How is the git hash calculated?"): a blob SHA1 is computed based on the content and size.

See "Lifetimes of cryptographic hash functions" from Valerie Anita Aurora for more.
In that page, she notes:

Google spent 6500 CPU years and 110 GPU years to convince everyone we need to stop using SHA-1 for security critical applications.
Also because it was cool

See more in my separate answer below.

I did an experiment to find out exactly how Git would behave in this case. This is with version 2.7.9~rc0+next.20151210 (Debian version). I basically just reduced the hash size from 160-bit to 4-bit by applying the following diff and rebuilding git:

--- git-2.7.0~rc0+next.20151210.orig/block-sha1/sha1.c
+++ git-2.7.0~rc0+next.20151210/block-sha1/sha1.c
@@ -246,6 +246,8 @@ void blk_SHA1_Final(unsigned char hashou
    blk_SHA1_Update(ctx, padlen, 8);

    /* Output hash */
-   for (i = 0; i < 5; i++)
-       put_be32(hashout + i * 4, ctx->H[i]);
+   for (i = 0; i < 1; i++)
+       put_be32(hashout + i * 4, (ctx->H[i] & 0xf000000));
+   for (i = 1; i < 5; i++)
+       put_be32(hashout + i * 4, 0);
 }

Then I did a few commits and noticed the following.

If a blob already exists with the same hash, you will not get any warnings at all. Everything seems to be ok, but when you push, someone clones, or you revert, you will lose the latest version (in line with what is explained above).
If a tree object already exists and you make a blob with the same hash: Everything will seem normal, until you either try to push or someone clones your repository. Then you will see that the repo is corrupt.
If a commit object already exists and you make a blob with the same hash: same as #2 - corrupt
If a blob already exists and you make a commit object with the same hash, it will fail when updating the "ref".
If a blob already exists and you make a tree object with the same hash. It will fail when creating the commit.
If a tree object already exists and you make a commit object with the same hash, it will fail when updating the "ref".
If a tree object already exists and you make a tree object with the same hash, everything will seem ok. But when you commit, all of the repository will reference the wrong tree.
If a commit object already exists and you make a commit object with the same hash, everything will seem ok. But when you commit, the commit will never be created, and the HEAD pointer will be moved to an old commit.
If a commit object already exists and you make a tree object with the same hash, it will fail when creating the commit.

For #2 you will typically get an error like this when you run "git push":

error: object 0400000000000000000000000000000000000000 is a tree, not a blob
fatal: bad blob object
error: failed to push some refs to origin

or:

error: unable to read sha1 file of file.txt (0400000000000000000000000000000000000000)

if you delete the file and then run "git checkout file.txt".

For #4 and #6, you will typically get an error like this:

error: Trying to write non-commit object
f000000000000000000000000000000000000000 to branch refs/heads/master
fatal: cannot update HEAD ref

when running "git commit". In this case you can typically just type "git commit" again since this will create a new hash (because of the changed timestamp)

For #5 and #9, you will typically get an error like this:

fatal: 1000000000000000000000000000000000000000 is not a valid 'tree' object

when running "git commit"

If someone tries to clone your corrupt repository, they will typically see something like:

git clone (one repo with collided blob,
d000000000000000000000000000000000000000 is commit,
f000000000000000000000000000000000000000 is tree)

Cloning into 'clonedversion'...
done.
error: unable to read sha1 file of s (d000000000000000000000000000000000000000)
error: unable to read sha1 file of tullebukk
(f000000000000000000000000000000000000000)
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

What "worries" me is that in two cases (2,3) the repository becomes corrupt without any warnings, and in 3 cases (1,7,8), everything seems ok, but the repository content is different than what you expect it to be. People cloning or pulling will have a different content than what you have. The cases 4,5,6 and 9 are ok, since it will stop with an error. I suppose it would be better if it failed with an error at least in all cases.

According to Pro Git:

If you do happen to commit an object that hashes to the same SHA-1 value as a previous object in your repository, Git will see the previous object already in your Git database and assume it was already written. If you try to check out that object again at some point, you’ll always get the data of the first object.

So it wouldn't fail, but it wouldn't save your new object either.
I don't know how that would look on the command line, but that would certainly be confusing.

A bit further down, that same reference attempts to illustrate the likely-ness of such a collision:

Here’s an example to give you an idea of what it would take to get a SHA-1 collision. If all 6.5 billion humans on Earth were programming, and every second, each one was producing code that was the equivalent of the entire Linux kernel history (1 million Git objects) and pushing it into one enormous Git repository, it would take 5 years until that repository contained enough objects to have a 50% probability of a single SHA-1 object collision. A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.

How would Git handle a SHA-1 collision on a blob?

Tags:

Git

Hash Collision

Related

Recent Posts