Understanding git gc --auto

One of the main points of gc --auto is that it should be very quick, so other commands can frequently call it “just in case”. To achieve that, the object count is only guessed. As git help config says under gc.auto:

When there are approximately more than this many loose objects in the repository […]

Looking at the code (too_many_loose_objects() in buildin/gc.c), here’s what happens:

The gc.auto is divided by 256 and rounded up
The folder that contains all the objects that start with 17 is opened
It is checked if the folder contains more objects than the result of step 1

This works fine, since SHA-1 is evenly distributed, so “all the objects that start with X” is representative for the whole set. But of course this only works for a big big amount of objects. To lazy to do the maths, I would guess at least >3000. With 6700 (the default value of gc.auto), this should already work quite reliably.

The core question for me is why you need such a low setting and whether it is important that this really runs at 250 objects. With a setting of 250, gc will run as soon as you have 2 loose objects that start with 17. The chance that this happens is > 80% for 600 objects and > 90% for 800 objects.

Update: Couldn’t help it – had to do the math :). I was wondering how well that estimation system would work. Here’s a plot of the results. For any given gc.auto, how high is the probability that gc will start when there are gc.auto (red) / gc.auto * 1.1 (green) / gc.auto * 1.2 (orange) / gc.auto * 1.5 (blue) / gc.auto * 2 (purple) loose objects in the repo?

Plot of the results

Note that gc auto is be more robust in Git 2.12.2 (released March 2017, two days ago).

See commit a831c06 (10 Feb 2017) by David Turner (csusbdt).
Helped-by: Jeff King (peff).
^{(Merged by Junio C Hamano -- gitster -- in commit d30ec1b, 21 Mar 2017)}

gc: ignore old gc.log files

A server can end up in a state where there are lots of unreferenced loose objects (say, because many users are doing a bunch of rebasing and pushing their rebased branches).
Running "git gc --auto" in this state would cause a gc.log file to be created, preventing future auto gcs, causing pack files to pile up.
Since many git operations are O(n) in the number of pack files, this would lead to poor performance.

Git should never get itself into a state where it refuses to do any maintenance, just because at some point some piece of the maintenance didn't make progress.

Teach Git to ignore gc.log files which are older than (by default) one day old, which can be tweaked via the gc.logExpiry configuration variable.
That way, these pack files will get cleaned up, if necessary, at least once per day. And operators who find a need for more-frequent gcs can adjust gc.logExpiry to meet their needs.

Note: since Git 2.17 (Q2 2018), git gc --auto will run on each git commit too.
See "List of all commands that cause git gc --auto".

And there is a pre-gc --auto hook associated to that command too.

Understanding git gc --auto

`gc`: ignore old `gc.log` files

Tags:

Git

Garbage Collection

Related

Recent Posts

Understanding git gc --auto

gc: ignore old gc.log files

Tags:

Git

Garbage Collection

Related

`gc`: ignore old `gc.log` files