Impact of large number of branches in a git repo?

March 2015: I don't have benchmarks but one way to ensure a git fetch remains reasonable even if the upstream repo has a large set of branches would be to specific a less general refspec than the one by default.

fetch = +refs/heads/*:refs/remotes/origin/*

You can add as many fetch refspecs to a remote as you want, effectively replacing the catch-all refspec above with more specific specs to just include the branches you actually need (even though the remote repo has thousands of them)

fetch = +refs/heads/master:refs/remotes/origin/master
fetch = +refs/heads/br*:refs/remotes/origin/br*
fetch = +refs/heads/mybranch:refs/remotes/origin/mybranch
....

April 2018: git fetch will improve with Git 2.18 (Q2 2018).

See commit 024aa46 (14 Mar 2018) by Takuto Ikuta (atetubou).
(Merged by Junio C Hamano -- gitster -- in commit 5d806b7, 09 Apr 2018)

fetch-pack.c: use oidset to check existence of loose object

When fetching from a repository with large number of refs, because to check existence of each refs in local repository to packed and loose objects, 'git fetch' ends up doing a lot of lstat(2) to non-existing loose form, which makes it slow.

Instead of making as many lstat(2) calls as the refs the remote side advertised to see if these objects exist in the loose form, first enumerate all the existing loose objects in hashmap beforehand and use it to check existence of them if the number of refs is larger than the number of loose objects.

With this patch, the number of lstat(2) calls in git fetch is reduced from 411412 to 13794 for chromium repository, it has more than 480000 remote refs.

I took time stat of git fetch when fetch-pack happens for chromium repository 3 times on linux with SSD.

* with this patch
8.105s
8.309s
7.640s
avg: 8.018s

* master
12.287s
11.175s
12.227s
avg: 11.896s

On my MacBook Air which has slower lstat(2).

* with this patch
14.501s

* master
1m16.027s

git fetch on slow disk will be improved largely.


Note this hashmap used in packfile does improve with Git 2.24 (Q4 2019)

See commit e2b5038, commit 404ab78, commit 23dee69, commit c8e424c, commit 8a973d0, commit 87571c3, commit 939af16, commit f23a465, commit f0e63c4, commit 6bcbdfb, commit 973d5ee, commit 26b455f, commit 28ee794, commit b6c5241, commit b94e5c1, commit f6eb6bd, commit d22245a, commit d0a48a0, commit 12878c8, commit e010a41 (06 Oct 2019) by Eric Wong (ele828).
Suggested-by: Phillip Wood (phillipwood).
(Merged by Junio C Hamano -- gitster -- in commit 5efabc7, 15 Oct 2019)

For example:

packfile: use hashmap_entry in delta_base_cache_entry

Signed-off-by: Eric Wong
Reviewed-by: Derrick Stolee

This hashmap_entry_init function is intended to take a hashmap_entry struct pointer, not a hashmap struct pointer.

This was not noticed because hashmap_entry_init takes a "void *" arg instead of "struct hashmap_entry *", and the hashmap struct is larger and can be cast into a hashmap_entry struct without data corruption.

This has the beneficial side effect of reducing the size of a delta_base_cache_entry from 104 bytes to 72 bytes on 64-bit systems.


Before Git 2.29 (Q4 2020), there was a logic to estimate how many objects are in the repository, which is mean to run once per process invocation, but it ran every time the estimated value was requested.

This is faster with Git 2.29:

See commit 67bb65d (17 Sep 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 221b755, 22 Sep 2020)

packfile: actually set approximate_object_count_valid

Reported-by: Rasmus Villemoes
Signed-off-by: Jeff King

The approximate_object_count() function tries to compute the count only once per process. But ever since it was introduced in 8e3f52d778 (find_unique_abbrev: move logic out of get_short_sha1(), 2016-10-03, Git v2.11.0-rc0), we failed to actually set the "valid" flag, meaning we'd compute it fresh on every call.

This turns out not to be too bad, because we're only iterating through the packed_git list, and not making any system calls. But since it may get called for every abbreviated hash we output, even this can add up if you have many packs.

Here are before-and-after timings for a new perf test which just asks rev-list to abbreviate each commit hash (the test repo is linux.git, with commit-graphs):

Test                            origin              HEAD
----------------------------------------------------------------------------
5303.3: rev-list (1)            28.91(28.46+0.44)   29.03(28.65+0.38) +0.4%
5303.4: abbrev-commit (1)       1.18(1.06+0.11)     1.17(1.02+0.14) -0.8%
5303.7: rev-list (50)           28.95(28.56+0.38)   29.50(29.17+0.32) +1.9%
5303.8: abbrev-commit (50)      3.67(3.56+0.10)     3.57(3.42+0.15) -2.7%
5303.11: rev-list (1000)        30.34(29.89+0.43)   30.82(30.35+0.46) +1.6%
5303.12: abbrev-commit (1000)   86.82(86.52+0.29)   77.82(77.59+0.22) -10.4%
5303.15: load 10,000 packs      0.08(0.02+0.05)     0.08(0.02+0.06) +0.0%  

It doesn't help at all when we have 1 pack (5303.4), but we get a 10% speedup when there are 1000 packs (5303.12).
That's a modest speedup for a case that's already slow and we'd hope to avoid in general (note how slow it is even after, because we have to look in each of those packs for abbreviations). But it's a one-line change that clearly matches the original intent, so it seems worth doing.

The included perf test may also be useful for keeping an eye on any regressions in the overall abbreviation code.


Yes, it does. Locally, it's not much of a problem--though it does still affect several local commands. In particular, when you are trying to describe a commit based on the available refs.

Over the network, Git does an initial ref advertisement when you connect to it for updates. You can learn about this in the pack protocol document. The problem here is that your network connection may be flaky or latent, and that initial advertisement can take a while as a result. There has been discussions of removing this requirement, but, as always, compatibility issues make it complicated. The most recent discussion about it is here.

You probably want to look at a recent discussion about Git scaling too. There's many ways in which you may want Git to scale, and it's discussed the majority of them so far. I think it gives you a good idea what Git is good at, and where it could use some work. I'd summarize it for you, but I don't think I could do it justice. There's a lot of useful information there.


As others have pointed out, branches and other refs are just files in the file system (except that's not quite true because of packed refs) and are pretty cheap, but that doesn't mean their number can't affect performance. See e.g. the Poor push performance with large number of refs thread on the Git mailing list for a recent (Dec 2014) example of Git performance being affected by having 20k refs in a repository.

If I recall correctly, some part of the ref processing was O(n²) a few years ago but that can very well have been fixed since. There's a repo-discuss thread from March 2012 that contains some potentially useful details, if perhaps dated and specific to JGit.

The also somewhat dated Scaling Gerrit article talks about (among other things) potential problems with high ref counts, but also notes that several sites have gits with over 100k refs. We have a git with ~150k refs and I don't think we're seeing any performance issues with it.

One aspect of having lots of refs is the size of the ref advertisement at the start of some Git transactions. The size of the advertisement of aforementioned 150k ref git is about 10 MB, i.e. every single git fetch operation is going to download that amount of data.

So yes, don't ignore the issue completely but you shouldn't lose any sleep over a mere 2000 refs.

Tags:

Git