What is the git clone --filter option's syntax?

The format for filter-spec is defined in the options section of git rev-list --help. You can also see it on github. Here's what it currently says:

--filter=<filter-spec>

Only useful with one of the --objects*; omits objects (usually blobs) from the list of printed objects. The <filter-spec> may be one of the following:

The form --filter=blob:none omits all blobs.

The form --filter=blob:limit=<n>[kmg] omits blobs larger than n bytes or units. n may be zero. The suffixes k, m, and g can be used to name units in KiB, MiB, or GiB. For example, blob:limit=1k is the same as blob:limit=1024.

The form --filter=sparse:oid=<blob-ish> uses a sparse-checkout specification contained in the blob (or blob-expression) <blob-ish> to omit blobs that would not be not required for a sparse checkout on the requested refs.

The form --filter=sparse:path=<path> similarly uses a sparse-checkout specification contained in <path>.


What is the git clone --filter option's syntax?

This is at least clearer with Git 2.27 (Q2 2020)

Before that, here is a quick TLDR; example of that command, combined with a (cone) sparse-checkout:

#fastest clone possible:
git clone --filter=blob:none --no-checkout https://github.com/git/git
cd git
git sparse-checkout init --cone
git read-tree -mu HEAD

That will bring back only the top folder files, excluding by default any subfolder.
The initial clone remains faster, because of the git clone --filter=blob:none --no-checkout step.


Now, onto that git clone --filter option's syntax:

See commit 4a46544 (22 Mar 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit fa0c1eb, 22 Apr 2020)

clone: document --filter options

Signed-off-by: Derrick Stolee

It turns out that the "--filter=<filter-spec>" option is not documented anywhere in the "git clone" page, and instead is detailed carefully in "git rev-list" where it serves a different purpose.

Add a small bit about this option in the documentation. It would be worth some time to create a subsection in the "git clone" documentation about partial clone as a concept and how it can be a surprising experience. For example, "git checkout" will likely trigger a pack download.

The git clone documentation now includes:

--filter=<filter-spec>:

Use the partial clone feature and request that the server sends a subset of reachable objects according to a given object filter.

When using --filter, the supplied <filter-spec> is used for the partial clone filter.

For example, --filter=blob:none will filter out all blobs (file contents) until needed by Git.
Also, --filter=blob:limit=<size> will filter out all blobs of size at least <size>.

For more details on filter specifications, see the --filter option in git rev-list.


That option is less useful than I had hoped. (It can't be used to combine clone and filter-branch).

And yet this filtering mechanism is the extension of one associated with clone, for implementing the partial cloning (or narrow clone) introduced in Dec. 2017 with Git 2.16.

But your Git repo hosting server must support the protocol v2, supported for now (Oct. 2018) only by GitLab.

Meaning you can use --filter with git clone, as a recent Git 2.20 patch illustrates (see below).

That filter was then added to git fetch in this patch series.
It is part of a new pack-protocol capability "filter", added to the fetch-pack and upload-pack negotiation.
See "filter" in Documentation/technical/pack-protocol, which refers to the rev-list options.

With Git 2.20 (Q4 2018), a partial clone that is configured to lazily fetch missing objects will on-demand issue a "git fetch" request to the originating repository to fill not-yet-obtained objects.
The request has been optimized for requesting a tree object (and not the leaf blob objects contained in it) by telling the originating repository that no blobs are needed.

See commit 4c7f956, commit 12f19a9 (03 Oct 2018) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit fa54ccc, 19 Oct 2018)

fetch-pack: exclude blobs when lazy-fetching trees

A partial clone with missing trees can be obtained using "git clone --filter=tree:none <repo>".
In such a repository, when a tree needs to be lazily fetched, any tree or blob it directly or indirectly references is fetched as well, regardless of whether the original command required those objects, or if the local repository already had some of them.

This is because the fetch protocol, which the lazy fetch uses, does not allow clients to request that only the wanted objects be sent, which would be the ideal solution. This patch implements a partial solution: specify the "blob:none" filter, somewhat reducing the fetch payload.

This change has no effect when lazily fetching blobs (due to how filters work). And if lazily fetching a commit (such repositories are difficult to construct and is not a use case we support very well, but it is possible), referenced commits and trees are still fetched - only the blobs are not fetched.

The necessary code change is done in fetch_pack() instead of somewhere closer to where the "filter" instruction is written to the wire so that only one part of the code needs to be changed in order for users of all protocol versions to benefit from this optimization.

You can see further optimization with:

See commit e70a303, commit 6ab4055, commit 0177565, commit 99bcb88 (27 Sep 2018) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 0527fba, 19 Oct 2018)

transport: allow skipping of ref listing

The get_refs_via_connect() function both performs the handshake (including determining the protocol version) and obtaining the list of remote refs.

However, the fetch protocol v2 supports fetching objects without the listing of refs, so make it possible for the user to skip the listing by creating a new handshake() function.


Note the syntax has changed/evolved with Git 2.21 (Q1 2019) and its update of the protocol message specification to allow only the limited use of scaled quantities. This is ensure potential compatibility issues will not go out of hand.

See commit 87c2d9d (08 Jan 2019) by Josh Steadmon (steadmon).
See commit 8272f26, commit c813a7c (09 Jan 2019) by Matthew DeVore (matvore).
(Merged by Junio C Hamano -- gitster -- in commit 073312b, 05 Feb 2019)

filter-options: expand scaled numbers

When communicating with a remote server or a subprocess, use expanded numbers rather than numbers with scaling suffix in the object filter spec (e.g. "limit:blob=1k" becomes "limit:blob=1024").

Update the protocol docs to note that clients should always perform this expansion, to allow for more compatibility between server implementations.


As an aside, Git 2.23 (Q3 2019) consider the "invalid filter-spec" message is user-facing and not a BUG, so it makes localizeable.

See commit 5c03bc8 (31 May 2019) by Matthew DeVore (matvore).
(Merged by Junio C Hamano -- gitster -- in commit ca02d36, 21 Jun 2019)

list-objects-filter-options: error is localizeable

The "invalid filter-spec" message is user-facing and not a BUG, so make it localizeable.

For reference, the message appears in this context:

$ git rev-list --filter=blob:nonse --objects HEAD
fatal: invalid filter-spec 'blob:nonse'

With Git 2.24 (Q4 2019), the http transport, which lacked some optimization the native transports learned to avoid unnecessary ref advertisement, has been fixed.

See commit fddf2eb, commit ac3fda8 (21 Aug 2019) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit f67bf53, 18 Sep 2019)

transport-helper: skip ls-refs if unnecessary

Commit e70a303 ("fetch: do not list refs if fetching only hashes", 2018-10-07, Git v2.20.0-rc0) and its ancestors taught Git, as an optimization, to skip the ls-refs step when it is not necessary during a protocol v2 fetch (for example, when lazy fetching a missing object in a partial clone, or when running "git fetch --no-tags <remote> <SHA-1>").
But that was only done for natively supported protocols; in particular, HTTP was not supported.

Teach Git to skip ls-refs when using remote helpers that support connect or stateless-connect.
To do this, fetch() is made an acceptable entry point.
Because fetch() can now be the first function in the vtable called, "get_helper(transport);" has to be added to the beginning of that function to set the transport up (if not yet set up) before process_connect() is invoked.


Another optimization in Git 2.24 (Q4 2019)

See commit d8bc1a5 (08 Oct 2019) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit c7d2ced, 15 Oct 2019)

send-pack: never fetch when checking exclusions

Signed-off-by: Jonathan Tan

When building the packfile to be sent, send_pack() is given a list of remote refs to be used as exclusions.
For each ref, it first checks if the ref exists locally, and if it does, passes it with a "^" prefix to pack-objects.
However, in a partial clone, the check may trigger a lazy fetch.

The additional commit ancestry information obtained during such fetches may show that certain objects that would have been sent are already known to the server, resulting in a smaller pack being sent.
But this is at the cost of fetching from many possibly unrelated refs, and the lazy fetches do not help at all in the typical case where the client is up-to-date with the upstream of the branch being pushed.

Ensure that these lazy fetches do not occur.


Finally, Git 2.24 (Q4 2019) includes a last-minute work-around for a lazy fetch glitch, which illustrates one usage of the filter syntax.

See commit c7aadcc (23 Oct 2019) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit c32ca69, 04 Nov 2019)

fetch: delay fetch_if_missing=0 until after config

Signed-off-by: Jonathan Tan

Suppose, from a repository that has ".gitmodules", we clone with --filter=blob:none:

git clone --filter=blob:none --no-checkout \
https://kernel.googlesource.com/pub/scm/git/git

Then we fetch:

git -C git fetch

This will cause a "unable to load config blob object", because the fetch_config_from_gitmodules() invocation in cmd_fetch() will attempt to load ".gitmodules" (which Git knows to exist because the client has the tree of HEAD) while fetch_if_missing is set to 0.

fetch_if_missing is set to 0 too early - ".gitmodules" here should be lazily fetched.

Git must set fetch_if_missing to 0 before the fetch because as part of the fetch, packfile negotiation happens (and we do not want to fetch any missing objects when checking existence of objects), but we do not need to set it so early.
Move the setting of fetch_if_missing to the earliest possible point in cmd_fetch(), right before any fetching happens.


With Git 2.25 (Q1 2020), debugging support for lazy cloning has been a bit improved.
git fetch v2 now makes good use of promisor files.

See commit 5374a29 (15 Oct 2019) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 026587c, 10 Nov 2019)

fetch-pack: write fetched refs to .promisor

Signed-off-by: Jonathan Tan
Acked-by: Josh Steadmon

The specification of promisor packfiles (in partial-clone.txt) states that the .promisor files that accompany packfiles do not matter (just like .keep files), so whenever a packfile is fetched from the promisor remote, Git has been writing empty .promisor files.
But these files could contain more useful information.

So instead of writing empty files, write the refs fetched to these files.

This makes it easier to debug issues with partial clones, as we can identify what refs (and their associated hashes) were fetched at the time the packfile was downloaded, and if necessary, compare those hashes against what the promisor remote reports now.

This is implemented by teaching fetch-pack to write its own non-empty .promisor file whenever it knows the name of the pack's lockfile.
This covers the case wherein the user runs "git fetch" with an internal protocol or HTTP protocol v2 (fetch_refs_via_pack() in transport.c sets lock_pack) and with HTTP protocol v0/v1 (fetch_git() in remote-curl.c passes "--lock-pack" to "fetch-pack").


Before Git 2.29 (Q4 2020), fetching from a lazily cloned repository resulted at the server side in attempts to lazy fetch objects that the client side has, many of which will not be available from the third-party anyway.

See commit 77aa094 (16 Jul 2020) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 37f382a, 30 Jul 2020)

upload-pack: do not lazy-fetch "have" objects

Signed-off-by: Jonathan Tan

When upload-pack receives a request containing "have" hashes, it (among other things) checks if the served repository has the corresponding objects. However, it does not do so with the OBJECT_INFO_SKIP_FETCH_OBJECT flag, so if serving a partial clone, a lazy fetch will be triggered first.

This was discovered at $DAYJOB when a user fetched from a partial clone (into another partial clone - although this would also happen if the repo to be fetched into is not a partial clone).

Therefore, whenever "have" hashes are checked for existence, pass the OBJECT_INFO_SKIP_FETCH_OBJECT flag.
Also add the OBJECT_INFO_QUICK flag to improve performance, as it is typical that such objects do not exist in the serving repo, and the consequences of a false negative are minor (usually, a slightly larger pack sent).


With Git 2.29 (Q4 2020), the component to respond to "git fetch"(man) request is made more configurable to selectively allow or reject object filtering specification used for partial cloning.

See commit 6cc275e (05 Aug 2020) by Jeff King (peff).
See commit 5b01a4e, commit 6dd3456 (03 Aug 2020), and commit b9ea214 (31 Jul 2020) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 73a9255, 11 Aug 2020)

upload-pack.c: allow banning certain object filter(s)

Helped-by: Jeff King
Signed-off-by: Taylor Blau

Git clients may ask the server for a partial set of objects, where the set of objects being requested is refined by one or more object filters. Server administrators can configure 'git upload-pack(man) ' to allow or ban these filters by setting the 'uploadpack.allowFilter' variable to 'true' or 'false', respectively.

However, administrators using bitmaps may wish to allow certain kinds of object filters, but ban others. Specifically, they may wish to allow object filters that can be optimized by the use of bitmaps, while rejecting other object filters which aren't and represent a perceived performance degradation (as well as an increased load factor on the server).

Allow configuring 'git upload-pack(man) ' to support object filters on a case-by-case basis by introducing two new configuration variables:

  • 'uploadpackfilter.allow'
  • 'uploadpackfilter.<kind>.allow'

where '' may be one of 'blobNone', 'blobLimit', 'tree', and so on.

Setting the second configuration variable for any valid value of '<kind>' explicitly allows or disallows restricting that kind of object filter.

If a client requests the object filter <kind> and the respective configuration value is not set, 'git upload-pack(man) ' will default to the value of 'uploadpackfilter.allow', which itself defaults to 'true' to maintain backwards compatibility.
Note that this differs from 'uploadpack.allowfilter', which controls whether or not the 'filter' capability is advertised.

git config now includes in its man page:

uploadpackfilter.allow

Provides a default value for unspecified object filters (see: the below configuration variable).
Defaults to true.

uploadpackfilter.<filter>.allow

Explicitly allow or ban the object filter corresponding to <filter>, where <filter> may be one of: blob:none, blob:limit, tree, sparse:oid, or combine.
If using combined filters, both combine and all of the nested filter kinds must be allowed.
Defaults to uploadpackfilter.allow.


With Git 2.29 (Q4 2020), this lazy/partial clone --filter works even with a submodule, when transfer.fsckobjects is set.

See commit 1b03df5 (20 Aug 2020) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 63728e4, 31 Aug 2020)

fetch-pack: in partial clone, pass --promisor

Signed-off-by: Jonathan Tan

When fetching a pack from a promisor remote, the corresponding .promisor file needs to be created.
"fetch-pack" originally did this by passing "--promisor" to "index-pack", but in 5374a290aa ("fetch-pack: write fetched refs to .promisor", 2019-10-16, Git v2.25.0-rc0 -- merge listed in batch #1), "fetch-pack" was taught to do this itself instead, because it needed to store ref information in the .promisor file.

This causes a problem with superprojects when transfer.fsckobjects is set, because in the current implementation, it is "index-pack" that calls fsck_finish() to check the objects; before 5374a290aa, fsck_finish() would see that .gitmodules is a promisor object and tolerate it being missing, but after, there is no .promisor file (at the time of the invocation of fsck_finish() by "index-pack") to tell it that .gitmodules is a promisor object, so it returns an error.

Therefore, teach "fetch-pack" to pass "--promisor" to index pack once again.
"fetch-pack" will subsequently overwrite this file with the ref information.

An alternative is to instead move object checking to "fetch-pack", and let "index-pack" only index the files.
However, since "index-pack" has to inflate objects in order to index them, it seems reasonable to also let it check the objects (which also require inflated files).


With Git 2.30 (Q1 2021), Fix potential server side resource deallocation issues when responding to a partial clone request.

See commit 8d133f5, commit aab179d (03 Dec 2020) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 21127fa, 17 Dec 2020)

upload-pack.c: don't free allowed_filters util pointers

Signed-off-by: Taylor Blau

To keep track of which object filters are allowed or not, 'git upload-pack'(man) stores the name of each filter in a string_list, and sets it ->util pointer to be either 0 or 1, indicating whether it is banned or allowed.

Later on, we attempt to clear that list, but we incorrectly ask for the util pointers to be free()'d, too. This behavior (introduced back in 6dd3456a8c ("[upload-pack.c](https://github.com/git/git/blob/8d133f500a5390a089988141cdec8154a732764d/upload-pack.c): allow banning certain object filter(s)", 2020-08-03, Git v2.29.0-rc0 -- merge listed in batch #6)) leads to an invalid free, and causes us to crash.

In order to trigger this, one needs to fetch from a server that
(a) has at least one object filter allowed, and
(b) issue a fetch that contains a subset of the allowed filters (i.e., we cannot ask for a banned filter, since this causes us to die() before we hit the bogus string_list_clear()).

In that case, whatever banned filters exist will cause a noop free() (since those ->util pointers are set to 0), but the first allowed filter we try to free will crash us.

We never noticed this in the tests because we didn't have an example of setting 'uploadPackFilter' configuration variables and then following up with a valid fetch. The first new 'git clone'(man) prevents further regression here. For good measure on top, add a test which checks the same behavior at a tree depth greater than 0.

Tags:

Git