What is my bottleneck when cloning a git repository from a virtual machine with a fast network connection?

PS. Fair warning:

git is generally considered blazingly fast. You should try cloning a full repo from darcs, bazaar, hg (god forbid: TFS or subversion...). Also, if you routinely clone full repos from scratch, you'd be doing something wrong anyway. You can always just git remote update and get incremental changes.

For various other ways to keep full repos in synch see, e.g.

  • "fetch --all" in a git bare repository doesn't synchronize local branches to the remote ones
  • How to update a git clone --mirror?

(The contain links to other relevant SO posts)

Dumb copy

As mentioned you could just copy a repository with 'dumb' file transfer.

This will certainly not waste time compressing, repacking, deltifying and/or filtering.

Plus, you will get

  • hooks
  • config (remotes, push branches, settings (whitespace, merge, aliases, user details etc.)
  • stashes (see Can I fetch a stash from a remote repo into a local branch? also)
  • rerere cache
  • reflogs
  • backups (from filter-branch, e.g.) and various other things (intermediate state from rebase, bisect etc.)

This may or may not be what you require, but it is nice to be aware of the fact


Bundle

Git clone by default optimizes for bandwidth. Since git clone, by default, does not mirror all branches (see --mirror) it would not make sense to just dump the pack-files as-is (because that will send possibly way more than required).

When distributing to a truly big number of clients, consider using bundles.

If you want a fast clone without the server-side cost, the git way is bundle create. You can now distribute the bundle, without the server even being involved. If you mean that bundle... --all includes more than simple git clone, consider e.g. bundle ... master to reduce the volume.

git bundle create snapshot.bundle --all # (or mention specific ref names instead of --all)

and distribute the snapshot bundle instead. That's the best of both worlds, while of course you won't get the items from the bullet list above. On the receiving end, just

git clone snapshot.bundle myclonedir/

Compression configs

You can look at lowering server load by reducing/removing compression. Have a look at these config settings (I assume pack.compression may help you lower the server load)

core.compression

An integer -1..9, indicating a default compression level. -1 is the zlib default. 0 means no compression, and 1..9 are various speed/size tradeoffs, 9 being slowest. If set, this provides a default to other compression variables, such as core.loosecompression and pack.compression.

core.loosecompression

An integer -1..9, indicating the compression level for objects that are not in a pack file. -1 is the zlib default. 0 means no compression, and 1..9 are various speed/size tradeoffs, 9 being slowest. If not set, defaults to core.compression. If that is not set, defaults to 1 (best speed).

pack.compression

An integer -1..9, indicating the compression level for objects in a pack file. -1 is the zlib default. 0 means no compression, and 1..9 are various speed/size tradeoffs, 9 being slowest. If not set, defaults to core.compression. If that is not set, defaults to -1, the zlib default, which is "a default compromise between speed and compression (currently equivalent to level 6)."

Note that changing the compression level will not automatically recompress all existing objects. You can force recompression by passing the -F option to git-repack(1).

Given ample network bandwidth, this will in fact result in faster clones. Don't forget about git-repack -F when you decide to benchmark that!


Use the depth to create a shallow clone.

git clone --depth 1 <repository>

The git clone --depth=1 ... suggested in 2014 will become faster in Q2 2019 with Git 2.22.
That is because, during an initial "git clone --depth=..." partial clone, it is pointless to spend cycles for a large portion of the connectivity check that enumerates and skips promisor objects (which by definition is all objects fetched from the other side).
This has been optimized out.

clone: do faster object check for partial clones

For partial clones, doing a full connectivity check is wasteful; we skip promisor objects (which, for a partial clone, is all known objects), and enumerating them all to exclude them from the connectivity check can take a significant amount of time on large repos.

At most, we want to make sure that we get the objects referred to by any wanted refs.
For partial clones, just check that these objects were transferred.

Result:

  Test                          dfa33a2^         dfa33a2
  -------------------------------------------------------------------------
  5600.2: clone without blobs   18.41(22.72+1.09)   6.83(11.65+0.50) -62.9%
  5600.3: checkout of result    1.82(3.24+0.26)     1.84(3.24+0.26) +1.1%

62% faster!


With Git 2.26 (Q1 2020), an unneeded connectivity check is now disabled in a partial clone when fetching into it.

See commit 2df1aa2, commit 5003377 (12 Jan 2020) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 8fb3945, 14 Feb 2020)

connected: verify promisor-ness of partial clone

Signed-off-by: Jonathan Tan
Reviewed-by: Jonathan Nieder

Commit dfa33a298d ("clone: do faster object check for partial clones", 2019-04-21, Git v2.22.0-rc0 -- merge) optimized the connectivity check done when cloning with --filter to check only the existence of objects directly pointed to by refs.
But this is not sufficient: they also need to be promisor objects.
Make this check more robust by instead checking that these objects are promisor objects, that is, they appear in a promisor pack.

And:

fetch: forgo full connectivity check if --filter

Signed-off-by: Jonathan Tan
Reviewed-by: Jonathan Nieder

If a filter is specified, we do not need a full connectivity check on the contents of the packfile we just fetched; we only need to check that the objects referenced are promisor objects.

This significantly speeds up fetches into repositories that have many promisor objects, because during the connectivity check, all promisor objects are enumerated (to mark them UNINTERESTING), and that takes a significant amount of time.


And, still with Git 2.26 (Q1 2020), The object reachability bitmap machinery and the partial cloning machinery were not prepared to work well together, because some object-filtering criteria that partial clones use inherently rely on object traversal, but the bitmap machinery is an optimization to bypass that object traversal.

There however are some cases where they can work together, and they were taught about them.

See commit 20a5fd8 (18 Feb 2020) by Junio C Hamano (gitster).
See commit 3ab3185, commit 84243da, commit 4f3bd56, commit cc4aa28, commit 2aaeb9a, commit 6663ae0, commit 4eb707e, commit ea047a8, commit 608d9c9, commit 55cb10f, commit 792f811, commit d90fe06 (14 Feb 2020), and commit e03f928, commit acac50d, commit 551cf8b (13 Feb 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 0df82d9, 02 Mar 2020)

pack-bitmap: implement BLOB_LIMIT filtering

Signed-off-by: Jeff King

Just as the previous commit implemented BLOB_NONE, we can support BLOB_LIMIT filters by looking at the sizes of any blobs in the result and unsetting their bits as appropriate.
This is slightly more expensive than BLOB_NONE, but still produces a noticeable speedup (these results are on git.git):

Test                                         HEAD~2            HEAD
------------------------------------------------------------------------------------
5310.9:  rev-list count with blob:none       1.80(1.77+0.02)   0.22(0.20+0.02) -87.8%
5310.10: rev-list count with blob:limit=1k   1.99(1.96+0.03)   0.29(0.25+0.03) -85.4%

The implementation is similar to the BLOB_NONE one, with the exception that we have to go object-by-object while walking the blob-type bitmap (since we can't mask out the matches, but must look up the size individually for each blob).
The trick with using ctz64() is taken from show_objects_for_type(), which likewise needs to find individual bits (but wants to quickly skip over big chunks without blobs).


Git 2.27 (Q2 2020) will simplify the commit ancestry connectedness check in a partial clone repository in which "promised" objects are assumed to be obtainable lazily on-demand from promisor remote repositories.

See commit 2b98478 (20 Mar 2020) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 0c60105, 22 Apr 2020)

connected: always use partial clone optimization

Signed-off-by: Jonathan Tan
Reviewed-by: Josh Steadmon

With 50033772d5 ("connected: verify promisor-ness of partial clone", 2020-01-30, Git v2.26.0-rc0 -- merge listed in batch #5), the fast path (checking promisor packs) in check_connected() now passes a subset of the slow path (rev-list) - if all objects to be checked are found in promisor packs, both the fast path and the slow path will pass; otherwise, the fast path will definitely not pass.

This means that we can always attempt the fast path whenever we need to do the slow path.

The fast path is currently guarded by a flag; therefore, remove that flag.
Also, make the fast path fallback to the slow path - if the fast path fails, the failing OID and all remaining OIDs will be passed to rev-list.

The main user-visible benefit is the performance of fetch from a partial clone - specifically, the speedup of the connectivity check done before the fetch.
In particular, a no-op fetch into a partial clone on my computer was sped up from 7 seconds to 0.01 seconds. This is a complement to the work in 2df1aa239c ("fetch: forgo full connectivity check if --filter", 2020-01-30, Git v2.26.0-rc0 -- merge listed in batch #5), which is the child of the aforementioned 50033772d5. In that commit, the connectivity check after the fetch was sped up.

The addition of the fast path might cause performance reductions in these cases:

  • If a partial clone or a fetch into a partial clone fails, Git will fruitlessly run rev-list (it is expected that everything fetched would go into promisor packs, so if that didn't happen, it is most likely that rev-list will fail too).

  • Any connectivity checks done by receive-pack, in the (in my opinion, unlikely) event that a partial clone serves receive-pack.

I think that these cases are rare enough, and the performance reduction in this case minor enough (additional object DB access), that the benefit of avoiding a flag outweighs these.


With Git 2.27 (Q2 2020), the object walk with object filter "--filter=tree:0" can now take advantage of the pack bitmap when available.

See commit 9639474, commit 5bf7f1e (04 May 2020) by Jeff King (peff).
See commit b0a8d48, commit 856e12c (04 May 2020) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 69ae8ff, 13 May 2020)

pack-bitmap.c: support 'tree:0' filtering

Signed-off-by: Taylor Blau

In the previous patch, we made it easy to define other filters that exclude all objects of a certain type. Use that in order to implement bitmap-level filtering for the '--filter=tree:<n>' filter when 'n' is equal to 0.

The general case is not helped by bitmaps, since for values of 'n > 0', the object filtering machinery requires a full-blown tree traversal in order to determine the depth of a given tree.
Caching this is non-obvious, too, since the same tree object can have a different depth depending on the context (e.g., a tree was moved up in the directory hierarchy between two commits).

But, the 'n = 0' case can be helped, and this patch does so.
Running p5310.11 in this tree and on master with the kernel, we can see that this case is helped substantially:

Test                                  master              this tree
--------------------------------------------------------------------------------
5310.11: rev-list count with tree:0   10.68(10.39+0.27)   0.06(0.04+0.01) -99.4%

And:

See commit 9639474, commit 5bf7f1e (04 May 2020) by Jeff King (peff).
See commit b0a8d48, commit 856e12c (04 May 2020) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 69ae8ff, 13 May 2020)

pack-bitmap: pass object filter to fill-in traversal

Signed-off-by: Jeff King
Signed-off-by: Taylor Blau

Sometimes a bitmap traversal still has to walk some commits manually, because those commits aren't included in the bitmap packfile (e.g., due to a push or commit since the last full repack).

If we're given an object filter, we don't pass it down to this traversal.
It's not necessary for correctness because the bitmap code has its own filters to post-process the bitmap result (which it must, to filter out the objects that are mentioned in the bitmapped packfile).

And with blob filters, there was no performance reason to pass along those filters, either. The fill-in traversal could omit them from the result, but it wouldn't save us any time to do so, since we'd still have to walk each tree entry to see if it's a blob or not.

But now that we support tree filters, there's opportunity for savings. A tree:depth=0 filter means we can avoid accessing trees entirely, since we know we won't them (or any of the subtrees or blobs they point to).
The new test in p5310 shows this off (the "partial bitmap" state is one where HEAD~100 and its ancestors are all in a bitmapped pack, but HEAD~100..HEAD are not).

Here are the results (run against linux.git):

Test                                                  HEAD^               HEAD
-------------------------------------------------------------------------------------------------
[...]
5310.16: rev-list with tree filter (partial bitmap)   0.19(0.17+0.02)     0.03(0.02+0.01) -84.2%

The absolute number of savings isn't huge, but keep in mind that we only omitted 100 first-parent links (in the version of linux.git here, that's 894 actual commits).

In a more pathological case, we might have a much larger proportion of non-bitmapped commits. I didn't bother creating such a case in the perf script because the setup is expensive, and this is plenty to show the savings as a percentage.