What does git fetch exactly do?

git fetch itself is really quite simple. The complicated parts come before and after.

The first thing to know here is that Git stores commits. In fact, this is essentially what Git is about: it manages a collection of commits. This collection rarely shrinks: for the most part, the only thing you ever do with this collection of commits is add new commits.

Commits, the index, and the work-tree

Each commit has several pieces of information, such as the author's name and email address and a time-stamp. Each commit also saves a complete snapshot of all the files you told it to: these are the files stored in your index (also known as your staging area) at the time you ran git commit. This is also true of commits you obtain from someone else: they save the files that were in the other user's index at the time the other user ran git commit.

Note that each Git repository has just the one index, at least initially. This index is linked with the one work-tree. In newer Git versions, you can use git worktree add to add additional work-trees; each new work-tree comes with one new index/staging-area. The point of this index is to act as an intermediate file-holder, situated between "the current commit" (aka HEAD) and the work-tree. Initially, the HEAD commit and the index normally match: they contain the same versions of all the committed files. Git copies the files from HEAD into the index, and then from the index into the work-tree.

It's easy to see the work-tree: it has your files in their ordinary format, where you can view and edit them with all the regular tools on your computer. If you write Java or Python code, or HTML for a web server, the work-tree files are usable by the compiler or interpreter or web-server. The files stored in the index, and stored in each Git commit, do not have this form and are not usable by the compilers, interpreters, web-servers, and so on.

One other thing to remember about commits is that once a file is in a commit, it cannot be changed. No part of any commit can ever change. A commit is therefore permanent—or at least, permanent unless it is removed (which can be done but is difficult and usually undesirable). What is in the index and work-tree, however, can be modified at any time. This is why they exist: the index is almost a "modifiable commit" (except that it's not saved until you run git commit), and the work-tree keeps the files in the form that the rest of the computer can use.1


1It's not necessary to have both the index and the work-tree. The VCS could treat the work-tree as the "modifiable commit". This is what Mercurial does; this is why Mercurial does not need an index. This is arguably a better design—but it's not the way Git works, so when using Git, you have an index. The presence of the index is a large part of what makes Git so fast: without it, Mercurial has to be extra-clever, and is still not as fast as Git.


Commits remember their parent; new commits are children

When you make a new commit by running git commit, Git takes the index contents and makes a permanent snapshot of everything that is in it right at that point. (This is why you must git add files: you copy them from your work-tree, where you have changed them, back into your index, so that they are ready to be "photographed" for the new snapshot.) Git also collects a commit message, and of course uses your name and email address and the current time, to make the new commit.

But Git also stores, in the new commit, the hash ID of the current commit. We say that the new commit "points back to" the current commit. Consider, for instance, this simple three-commit repository:

A <-B <-C   <-- master (HEAD)

Here we say that the branch name master "points to" the third commit, which I have labeled C, rather than using one of Git's incomprehensible hash IDs like b06d364.... (The name HEAD refers to the branch name, master. This is how Git can turn the string HEAD into the correct hash ID: Git follows HEAD to master, then reads the hash ID out of master.) It's commit C itself that "points to"—retains the hash ID of—commit B, though; and commit B points to commit A. (Since commit A is the very first commit ever, there is no earlier commit for it to point to, so it doesn't point anywhere at all, which makes it a bit special. This is called a root commit.)

To make a new commit, Git packages up the index into a snapshot, saves that with your name and email address and so on, and includes the hash ID of commit C, to make a new commit with a new hash ID. We will use D instead of the new hash ID since we don't know what the new hash ID will be:

A <-B <-C <-D

Note how D points to C. Now that D exists, Git alters the hash ID stored under the name master, to store D's hash ID instead of C's. The name stored in HEAD itself does not change at all: it's still master. So now we have this:

A <-B <-C <-D   <-- master (HEAD)

You can see from this diagram how Git works: given a name, like master, Git simply follows the arrow to find the latest commit. That commit has a backwards arrow to its earlier or parent commit, which has another backwards arrow to its own parent, and so on, throughout all its ancestors leading back to the root commit.

Note that while children remember their parents, the parent commits do not remember their children. This is because no part of any commit can ever change: Git literally can't add the children to the parent, and it does not even try. Git must always work backwards, from newer to older. The commit arrows all automatically point backwards, so normally I do not even draw them:

A--B--C--D   <-- master (HEAD)

Distributed repositories: what git fetch does

When we use git fetch, we have two different Gits, with different—but related—repositories. Suppose we have two Git repositories, on two different computers, that both start out with those same three commits:

A--B--C

Because they start out with the exact same commits, these three commits also have the same hash IDs. This part is very clever and is the reason the hash IDs are the way they are: the hash ID is a checksum2 of the contents of the commit, so that any two commits that are exactly identical always have the same hash ID.

Now, you, in your Git and your repository, have added a new commit D. Meanwhile they—whoever they are—may have added their own new commits. We'll use different letters since their commits will necessarily have different hashes. We'll also look at this mostly from your (Harry's) point of view; we'll call them "Sally". We'll add one more thing to our picture of your repository: it now looks like this:

A--B--C   <-- sally/master
       \
        D   <-- master (HEAD)

Now let's assume that Sally made two commits. In her repository, she now has this:

A--B--C--E--F   <-- master (HEAD)

or perhaps (if she fetches from you, but has not yet run git fetch):

A--B--C   <-- harry/master
       \
        E--F   <-- master (HEAD)

When you run git fetch, you connect your Git to Sally's Git, and ask her if she has any new commits added to her master since commit C. She does—she has her new commits E and F. So your Git gets those commits from her, along with everything needed to complete the snapshots for those commits. Your Git then adds those commits to your repository, so that you now have this:

        E--F   <-- sally/master
       /
A--B--C
       \
        D   <-- master (HEAD)

As you can see, what git fetch did for you was to collect all of her new commits and add them to your repository.

In order to remember where her master is, now that you have talked with her Git, your Git copies her master to your sally/master. Your own master, and your own HEAD, do not change at all. Only these "memory of another Git repository" names, which Git calls remote-tracking branch names, change.


2This hash is a cryptographic hash, in part so that it's difficult to fool Git, and in part because cryptographic hashes naturally behave well for Git's purposes. The current hash uses SHA-1, which was secure but has seen brute-force attacks and is now being abandoned for cryptography. Git will likely move to SHA2-256 or SHA3-256 or some other larger hash. There will be a transition period with some unpleasantness. :-)


You should now merge or rebase—git reset is generally wrong

Note that after you have fetched from Sally, it is your repository, and only your repository, that has all the work from both of you. Sally still does not have your new commit D.

This is still true even if instead of "Sally", your other Git is called origin. Now that you have both master and origin/master, you must do something to connect your new commit D with their latest commit F:

A--B--C--D   <-- master (HEAD)
       \
        E--F   <-- origin/master

(I moved D on top for graph-drawing reasons, but this is the same graph as before,

Your main two choices here are to use git merge or git rebase. (There are other ways to do this but these are the two to learn.)

Merge is actually simpler as git rebase does something that involves the verb form of merging, to merge. What git merge does is to run the verb form of merging, and then commit the result as a new commit that is called a merge commit or simply "a merge", which is the noun form of merging. We can draw the new merge commit G this way:

A--B--C--D---G   <-- master (HEAD)
       \    /
        E--F   <-- origin/master

Unlike a regular commit, a merge commit has two parents.3 It connects back to both of the two earlier commits that were used to make the merge. This makes it possible to push your new commit G to origin: G takes with it your D, but also connects back to their F, so their Git is OK with this new update.

This merge is the same kind of merge you get from merging two branches. And in fact, you did merge two branches here: you merged your master with Sally's (or origin's) master.

Using git rebase is usually easy, but what it does is more complicated. Instead of merging your commit D with their commit F to make a new merge commit G, what git rebase does is to copy each of your commits so that the new copies, which are new and different commits, come after the latest commit on your upstream.

Here, your upstream is origin/master, and the commits that you have that they don't is just your one commit D. So git rebase makes a copy of D, which I will call D', placing the copy after their commit F, so that D''s parent is F. The intermediate graph looks like this:5

A--B--C--D   <-- master
       \
        E--F   <-- origin/master
            \
             D'   <-- HEAD

The copying process uses the same merging code that git merge uses to do the verb form, to merge, of your changes from commit D.4 Once the copy is done, however, the rebase code sees that there are no more commits to copy, so it then changes your master branch to point to the final copied commit D':

A--B--C--D   [abandoned]
       \
        E--F   <-- origin/master
            \
             D'   <-- master (HEAD)

This abandons the original commit D.6 This means we can stop drawing it too, so now we get:

A--B--C--E--F   <-- origin/master
             \
              D'   <-- master (HEAD)

It's now easy to git push your new commit D' back to origin.


3In Git (but not Mercurial), a merge commit can have more than two parents. This doesn't do anything you cannot do by repeated merging, so it's mainly for showing off. :-)

4Technically, the merge base commit, at least for this case, is commit C and the two tip commits are D and F, so in this case it's literally exactly the same. If you rebase more than one commit, it gets a little more complicated, but in principle it's still straightforward.

5This intermediate state, where HEAD is detached from master, is usually invisible. You see it only if something goes wrong during the verb-form-of-merge, so that Git stops and has to get help from you to finish the merge operation. When that does occur, though—when there is a merge conflict during rebasing—it's important to know that Git is in this "detached HEAD" state, but as long as the rebase completes on its own, you don't have to care about this so much.

6The original commit chain is retained temporarily through Git's reflogs and via the name ORIG_HEAD. The ORIG_HEAD value gets overwritten by the next operation that makes a "big change", and the reflog entry eventually expires, typically after 30 days for this entry. After that, a git gc will really remove the original commit chain.


The git pull command just runs git fetch and then a second command

Note that after git fetch, you usually have to run a second Git command, either git merge or git rebase.

If you know in advance that you will, for certain, immediately use one of those two commands, you can use git pull, which runs git fetch and then runs one of those two commands. You pick which second command to run by setting pull.rebase or supplying --rebase as a command-line option.

Until you are quite familiar with how git merge and git rebase work, however, I suggest not using git pull, because sometimes git merge and git rebase fail to complete on their own. In this case, you must know how to deal with this failure. You must know which command you actually ran. If you run the command yourself, you will know which command you ran, and where to look for help if necessary. If you run git pull, you may not even know which second command you ran!

Besides this, sometimes you might want to look before you run the second command. How many commits did git fetch bring in? How much work will it be to do a merge vs a rebase? Is merge better than rebase right now, or is rebase better than merge? To answer any of these questions, you must separate the git fetch step from the second command. If you use git pull, you must decide in advance which command to run, before you even know which one is the one to use.

In short, only use git pull after you're familiar with the way the two parts of it—git fetch, and the second command you choose—really work.


You don't have to do two separate commits, and git fetch won't drop any log.

 --o--o--o (origin/master)
          \
           x--x (master: my local commits)

What you should do is rebase your local commit on top of any new commit fetched by the git fetch command:

git fetch

--o--o--o--O--O (origin/master updated)
         \
          x--x (master)

git rebase origin/master

--o--o--o--O--O (origin/master updated)
               \
                x'--x' (master rebased)

git push

--o--o--o--O--O--x'--x' (origin/master, master)

Even simpler, since Git 2.6, I would use the config:

git config pull.rebase true
git config rebase.autoStash true

Then a simple git pull would automatically replay your local commits on top of origin/master. Then you can git push.

Tags:

Git

Git Fetch