How does git matches blobs to files across commit trees?

That's actually quite a good question.

The internal storage form of a commit is partly relevant, so let's consider it for a moment. An individual commit is actually pretty small. Here is one from the Git repository for Git, namely commit b5101f929789889c2e536d915698f58d5c5c6b7a:

$ git cat-file -p b5101f929789889c2e536d915698f58d5c5c6b7a | sed 's/@/ /'
tree 3f109f9d1abd310a06dc7409176a4380f16aa5f2
parent a562a119833b7202d5c9b9069d1abb40c1f9b59a
author Junio C Hamano <gitster pobox.com> 1548795295 -0800
committer Junio C Hamano <gitster pobox.com> 1548795295 -0800

Fourth batch after 2.20

Signed-off-by: Junio C Hamano <gitster pobox.com>

(the sed 's/@/ /' is just to maybe, possibly, cut down on the amount of email spam that Junio Hamano must get :-) ). As you can see here, the commit object refers its parent commit object by the other commit's hash ID, a562a11983.... It also refers to a tree object by hash ID, and the tree object's hash ID begins with 3f109f9d1a. We can look at this tree object using git cat-file -p too:

$ git cat-file -p 3f109f9d1a | head
100644 blob de1c8b5c77f7566d9e41949e5e397db3cc1b487c    .clang-format
100644 blob 42cdc4bbfb05934bb9c3ed2fe0e0d45212c32d7a    .editorconfig
100644 blob 9fa72ad4503031528e24e7c69f24ca92bcc99914    .gitattributes
040000 tree 7ba15927519648dbc42b15e61739cbf5aeebf48b    .github
100644 blob 0d77ea5894274c43c4b348c8b52b8e665a1a339e    .gitignore
100644 blob cbeebdab7a5e2c6afec338c3534930f569c90f63    .gitmodules
100644 blob 247a3deb7e1418f0fdcfd9719cb7f609775d2804    .mailmap
100644 blob 03c8e4c613015476fffe3f1e071c0c9d6609df0e    .travis.yml
100644 blob 8c85014a0a936892f6832c68e3db646b6f9d2ea2    .tsan-suppressions
100644 blob 536e55524db72bd2acf175208aef4f3dfc148d42    COPYING

(the tree has quite a lot of data so I've copied only the first ten lines here).

Inside the tree, you see the mode (100644), type (blob—this is implied by the mode and is also recorded in the internal Git object; it's not actually stored in the tree object), hash ID (de1c8b5c77f...), and name (.clang-format) of a blob. You can also see that the tree can refer to additional tree objects, as is the case for the .github sub-tree.

If we take this particular blob object hash ID, we can view that object's contents by hash ID too:

$ git cat-file -p de1c8b5c77f | head
# This file is an example configuration for clang-format 5.0.
#
# Note that this style definition should only be understood as a hint
# for writing new code. The rules are still work-in-progress and does
# not yet exactly match the style we have in the existing code.

# Use tabs whenever we need to fill whitespace that spans at least from one tab
# stop to the next one.
#
# These settings are mirrored in .editorconfig.  Keep them in sync.

(again I've cut off the copy at 10 lines as the file is quite long).

Just for illustration let's look at the .github sub-tree too:

$ git cat-file -p 7ba15927519648dbc42b15e61739cbf5aeebf48b
100644 blob 64e605a02b71c51e9f59c429b28961c3152039b9    CONTRIBUTING.md
100644 blob adba13e5baf4603de72341068532e2c7d7d05f75    PULL_REQUEST_TEMPLATE.md

What Git does with these, then, is to read—recursively as needed—the tree object from a commit. Git will read these into a data structure it calls an index or cache. (The in-memory version of this is, technically speaking, the cache data structure, although Git documentation tends to be a bit loose about which names to use when.) So the cache built by reading commit b5101f929789889c2e536d915698f58d5c5c6b7a will say, for instance, that name .clang-format has mode 100644 and blob-hash de1c8b5c77f7566d9e41949e5e397db3cc1b487c, while name .github/CONTRIBUTING.md has mode 100644 and blob-hash 64e605a02b71c51e9f59c429b28961c3152039b9.

Note that the various name components (.github plus CONTRIBUTING.md) have, in effect, been joined-up in the in-memory cache. (In the on-disk format they're compressed via algorithmic trickery.)

The in-memory cache that helps Git match up file names

In the end, then, it's the internal (in-memory) cache that holds the <file-name, file-mode, blob-hash> tuples. If you ask Git to compare commit b5101f929789889c2e536d915698f58d5c5c6b7a to some other commit, Git reads the other commit into an in-memory cache as well. That other cache either has an entry named .github/CONTRIBUTING.md, or it doesn't.

If both commits have files that have the same names, Git assumes—for the purpose of this one comparison that Git is doing right now, and see below—that these are the same file. That's true whether the blob hashes are the same, or not.

The real question we're answering here has to do with identity. The identity of a file, in a version control system, determines whether that file is "the same" file in two different versions (however the version control system itself defines versions). This relates to the fundamental philosophical question of identity, as outlined in this Wikipedia article on the thought experiment about the Ship of Thesus: how do we know that something, or even someone, is who or what we think they are? If you met your cousin Bob when you and he were both very young, and you meet someone again who is named Bob, is he your cousin? You and he were tiny then; now you're larger and older, with different experiences. In the real world we seek cues from our environment: is Bob the child of people who are siblings of your parents? If so, that Bob probably is the same cousin Bob you met long ago, even if he (and you) look very different now.

Git, of course, doesn't do any of this. In most cases the simple fact that both files are named .github/CONTRIBUTING.md suffices to identify them as "the same file". The names are the same, so we're done.

git diff offers extra services

In our everyday development, we sometimes have occasion to rename a file. A file named a/b.c might be renamed to d/e.f or d/e.c for some reason.

Suppose we're on commit a123456 and the file is named a/b.c. Then we move to commit f789abc. That second commit has no a/b.c but does have a d/e.f. Git will simply remove a/b.c from our index (the on-disk form of the cache) and work-tree, and populate a new d/e.f into our index and work-tree, and all is well.

But suppose we ask Git to compare a123456 with f789abc. Git could just tell us: To change a123456 to f789abc, remove a/b.c and create a new d/e.f with these contents. That is what git checkout did and it suffices. But what if the contents exactly match? It's much more efficient for Git to tell us: To change a123456 to f789abc, rename a/b.c to d/e.f. And in fact, with the right options, git diff will do just that:

git diff --find-renames a123456 f789abc

How did Git manage this trick? The answer lies in computing file identity.

Finding file identity

Suppose that commit L (for left-side) has some file (a/b.c) that isn't in commit R (for right-side). Suppose further that commit R has some file (d/e.f) that isn't in commit L. Instead of immediately just telling us: you should remove the L file and use the R file, Git can now compare the contents of the two files.

Because of the nature of Git object hashes—they are completely deterministic, based on file contents—it's really easy for Git to detect that a/b.c in L is 100% identical to d/e.f in R. In this particular case, they will have exactly the same hash ID! So Git does that: if there's some file that's vanished from L and some other file that has appeared in R, and Git has been asked to find renames, Git checks for hash-ID matches. If it finds some, it pairs up those files (and takes them out of the queue of unmatched files—this queue, holding files from L and R, is the "rename detection queue").

Those files with differing names have been identified as the same file. Little cousin Bob is the same as big cousin Bob after all—except in this case, both of you still need to be little.

So, if this rename-detection hasn't yet paired a file in L with one in R, Git will try harder. Now it will extract the actual blobs, and compute a sort of "percentage of match". This uses a complicated little algorithm I won't describe here, but if enough sub-strings within the two files match, Git will declare the files to be 50, 60, 75, or more percent similar.

Having found one pair of files in the rename queue that are, say, 72% similar to each other, Git goes on to compare the files to all the other files as well. If it finds that one of those two is 94% similar to another, that similarity-pairing beats the 72% similarity-pairing. If not, the 72% similarity is sufficient—it's at least 50%—so Git will pair up those two files and declare that they have the same identity.

In any case, if the match is good enough and is the best one among all the unpaired files, that particular match is taken. Once again, little cousin Bob is the same as big cousin Bob after all.

After running this test on all unmatched file pairs, git diff takes the matched-up results and calls those files renamed. Again, this only happens if you use --find-renames (or -M), and you can set the threshold to something other than 50% if you like.

Breaking incorrect matches

The git diff command offers another service. Note that we started out by assuming that if commits L and R had files with the same name, those files were the same file, even if the contents differ. But what if they're not? What if file in L got renamed to bettername in R, and someone created a new file in R?

To handle this, git diff offers the -B (or "break pairing") option. With -B in effect, files that started out identified by name will have their pairing broken if they are too dis-similar. That is, Git will check whether the two blob hashes match, and if not, Git will compute a similarity index. If the index falls below some threshold, Git will break the pairing and put both files into the rename detection queue, before running the --find-renames style rename detector.

As a special twist, Git will re-pair broken pairings unless they are so extremely dissimilar that you don't want that to be done. Hence for -B you actually specify two similarity thresholds: the first number is when to tentatively break the pairing, and the second is when to permanently break it.

git merge uses git diff --find-renames

When you use git merge to perform a three-way merge, there are three inputs:

  • a merge base commit, which is an ancestor of both tip commits; and
  • a left and right commit, --ours and --theirs.

Git runs two git diff commands internally. One compares the base to L and the other compares the base to R.

Both of these diffs run with --find-renames enabled. If the diff from base to L finds a rename, Git knows to use the changes shown across that rename. Likewise, if the diff from base to R finds a rename, Git knows to use those changes. It will combine both sets of changes—and attempt (but usually fail) to combine both renames, if both diffs show a rename.

git log --follow also uses the rename detector

When using git log --follow, Git walks the commit history, one commit-pair—child-and-parent—at a time, doing diffs from parent to child. It turns on a limited form of the rename detection code to see if the one file you're --follow-ing was renamed in that commit pair. If so, as soon as git log moves to the parent, it changes which name it looks for. This technique works fairly well, but has some issues at merges (because merge commits have more than one parent).

Conclusion

File identity is what this is all about. Since Git doesn't know, a priori, that file a/b.c in commit L is or is not "the same" file as file d/e.f in commit R, Git can use rename detection to decide. In some cases—such as checking out commit L or R—this does not matter one bit. In some cases, such as diffing the two commits, it matters, but only for us as humans trying to understand what happened. But in a few cases, such as merging, it's very important.