How to use git sparse-checkout in 2.27+

I believe I found the reason for this. Commit f56f31af0301 to Git changed the implementation of sparse-checkout so that, when you have an uninitialized working tree (as you would right after running git clone --no-checkout), running git sparse-checkout init will not check out any files into your working tree. In previous versions, the command would actually check out files, which could have unexpected effects given that you wouldn't have an active branch at that point.

The relevant commit, f56f31af0301 was included in Git 2.27, but not in 2.25. That accounts for why the behavior you see is not the behavior shown on the web page you're trying to follow. Basically, the behavior on the web page was a bug that nobody realized was a bug at the time, but with Git 2.27, it has been fixed.

This is explained very well, I think, in the message for commit b5bfc08a972d:

So...that brings us to the special case: a git clone performed with --no-checkout. As per the meaning of the flag, --no-checkout does not check out any branch, with the implication that you aren't on one and need to switch to one after the clone. Implementationally, HEAD is still set (so in some sense you are partially on a branch), but

  • the index is "unborn" (non-existent)
  • there are no files in the working tree (other than .git/)
  • the next time git switch (or git checkout) is run it will run unpack_trees with initial_checkout flag set to true.

It is not until you run, e.g. git switch <somebranch> that the index will be written and files in the working tree populated.

With this special --no-checkout case, the traditional read-tree -mu HEAD behavior would have done the equivalent of acting like checkout -- switch to the default branch (HEAD), write out an index that matches HEAD, and update the working tree to match. This special case slipped through the avoid-making-changes checks in the original sparse-checkout command and thus continued there.

After update_sparsity() was introduced and used (see commit f56f31a ("sparse-checkout: use new update_sparsity() function", 2020-03-27)), the behavior for the --no-checkout case changed: Due to git's auto-vivification of an empty in-memory index (see do_read_index() and note that must_exist is false), and due to sparse-checkout's update_working_directory() code to always write out the index after it was done, we got a new bug. That made it so that sparse-checkout would switch the repository from a clone with an "unborn" index (i.e. still needing an initial_checkout), to one that had a recorded index with no entries. Thus, instead of all the files appearing deleted in git status being known to git as a special artifact of not yet being on a branch, our recording of an empty index made it suddenly look to git as though it was definitely on a branch with ALL files staged for deletion! A subsequent checkout or switch then had to contend with the fact that it wasn't on an initial_checkout but had a bunch of staged deletions.


With Git 2.35 (Q1 2022), the "init" and "set" subcommands in "git sparse-checkout"(man) have been unified for a better user experience and performance.

See commit dfac9b6 (23 Dec 2021), and commit d359541, commit d30e2bb, commit ba2f3f5, commit 4e25673, commit f2e3a21, commit be61fd1, commit f85751a, commit 45c5e47, commit 0b624e0, commit 1530ff3 (14 Dec 2021) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 2dc94da, 03 Jan 2022)

sparse-checkout: enable set to initialize sparse-checkout mode

Reviewed-by: Derrick Stolee
Reviewed-by: Victoria Dye
Signed-off-by: Elijah Newren

The previously suggested workflow: git sparse-checkout init ... git sparse-checkout set ...

Suffered from three problems:

  1. It would delete nearly all files in the first step, then restore them in the second.
    That was poor performance and forced unnecessary rebuilds.
  2. The two-step process resulted in two progress bars, which was suboptimal from a UI point of view for wrappers that invoked both of these commands but only exposed a single command to their end users.
  3. With cone mode, the first step would delete nearly all ignored files everywhere, because everything was considered to be outside of the specified sparsity paths.
    (The user was not allowed to specify any sparsity paths in the init step.)

Avoid these problems by teaching set to understand the extra parameters that init takes and performing any necessary initialization if not already in a sparse checkout.


I did mentioned before in "Why do excluded files keep reappearing in my git sparse checkout?" how any skip-worktree file should not be modified or even looked at during a sparse checkout anymore with Git 2.27+.

But with the new sparceIndex option with Git 2.32 (Q2 2021), that changes again:

With Git 2.32 (Q2 2021) adds sparse-index.

See "Make your monorepo feel small with Git’s sparse index" from Derrick Stolee.

sparse-index

See commit 4589bca, commit 71f82d0, commit 5f11669 (12 Apr 2021), commit f5fed74, commit dc26b23, commit 0c18c05, commit 465a04a, commit f7ef64b, commit 3450a30, commit d425f65, commit 2508df0, commit a029120, commit e43e2a1, commit 299e2c4, commit 42f44e8, commit 46eb6e3, commit 2227ea1, commit 48b3c7d, commit cb8388d, commit 0f6d3ba, commit 1b850d3, commit 54beed2, commit 118a2e8, commit 95e0321, commit 847a9e5, commit 839a663 (01 Apr 2021), and commit c9e40ae, commit 9ad2d5e, commit 2de37c5, commit dcc5fd5, commit 122ba1f, commit 58300f4, commit 0938e6f, commit 13e1331, commit f442313, commit 6e77352, commit cd42415, commit 836e25c, commit 6863df3, commit 2782db3, commit e2df6c3, commit ecfc47c, commit 4300f84, commit 3964fc2, commit 4b3f765, commit 0b5fcb0, commit 0ad6090 (30 Mar 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 8e97852, 30 Apr 2021)

sparse-index: design doc and format update

Signed-off-by: Derrick Stolee

This begins a long effort to update the index format to allow sparse directory entries.
This should result in a significant improvement to Git commands when HEAD contains millions of files, but the user has selected many fewer files to keep in their sparse-checkout definition.

Currently, the index format is only updated in the presence of extensions.sparseIndex instead of increasing a file format version number.
This is temporary, and index v5 is part of the plan for future work in this area.

The design document details many of the reasons for embarking on this work, and also the plan for completing it safely.

technical/index-format now includes in its man page:

An index entry typically represents a file. However, if sparse-checkout is enabled in cone mode (core.sparseCheckoutCone is enabled) and the extensions.sparseIndex extension is enabled, then the index may contain entries for directories outside of the sparse-checkout definition. These entries have mode 040000, include the SKIP_WORKTREE bit, and the path ends in a directory separator.

technical/sparse-index now includes in its man page:

Git Sparse-Index Design Document

The sparse-checkout feature allows users to focus a working directory on a subset of the files at HEAD. The cone mode patterns, enabled by core.sparseCheckoutCone, allow for very fast pattern matching to discover which files at HEAD belong in the sparse-checkout cone.

Three important scale dimensions for a Git working directory are:

  • HEAD: How many files are present at HEAD?

  • Populated: How many files are within the sparse-checkout cone.

  • Modified: How many files has the user modified in the working directory?

We will use big-O notation -- O(X) -- to denote how expensive certain operations are in terms of these dimensions.

These dimensions are ordered by their magnitude: users (typically) modify fewer files than are populated, and we can only populate files at HEAD.

Problems occur if there is an extreme imbalance in these dimensions. For example, if HEAD contains millions of paths but the populated set has only tens of thousands, then commands like git status and git add can be dominated by operations that require O(HEAD) operations instead of O(Populated). Primarily, the cost is in parsing and rewriting the index, which is filled primarily with files at HEAD that are marked with the SKIP_WORKTREE bit.

The sparse-index intends to take these commands that read and modify the index from O(HEAD) to O(Populated).

To do this, we need to modify the index format in a significant way: add "sparse directory" entries.

With cone mode patterns, it is possible to detect when an entire directory will have its contents outside of the sparse-checkout definition. Instead of listing all of the files it contains as individual entries, a sparse-index contains an entry with the directory name, referencing the object ID of the tree at HEAD and marked with the SKIP_WORKTREE bit. If we need to discover the details for paths within that directory, we can parse trees to find that list.

So you have a new option to git sparse-checkout init : --[no-]sparse-index

sparse-checkout: toggle sparse index from builtin

Signed-off-by: Derrick Stolee

The sparse index extension is used to signal that index writes should be in sparse mode.
This was only updated using GIT_TEST_SPARSE_INDEX=1.

Add a '--[no-]sparse-index' option to 'git sparse-checkout init'(man) that specifies if the sparse index should be used.
It also updates the index to use the correct format, either way.
Add a warning in the documentation that the use of a repository extension might reduce compatibility with third-party tools.
'git sparse-checkout init already sets extension.worktreeConfig, which places most sparse-checkout users outside of the scope of most third-party tools.

git sparse-checkout now includes in its man page:

Use the --[no-]sparse-index option to toggle the use of the sparse index format.

This reduces the size of the index to be more closely aligned with your sparse-checkout definition.

This can have significant performance advantages for commands such as git status or git add. This feature is still experimental. Some commands might be slower with a sparse index until they are properly integrated with the feature.

WARNING: Using a sparse index requires modifying the index in a way that is not completely understood by external tools. If you have trouble with this compatibility, then run git sparse-checkout init --no-sparse-index to rewrite your index to not be sparse.

Older versions of Git will not understand the sparse directory entries index extension and may fail to interact with your repository until it is disabled.


With Git 2.33 (Q3 2021), "git status"(man) codepath learned to work with sparsely populated index without hydrating it fully.

See commit e5ca291, commit f8fe49e, commit fe0d576, commit d76723e, commit bf48e5a, commit 9eb00af, commit 69bdbdb, commit 523506d, commit bd6a3fd, commit cd807a5, commit 17a1bb5, commit bf26c06, commit e669ffb, commit 3d814b5, commit 4741077, commit fc6609d (14 Jul 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit b271a30, 28 Jul 2021)

status: skip sparse-checkout percentage with sparse-index

Reviewed-by: Elijah Newren
Signed-off-by: Derrick Stolee

'git status'(man) began reporting a percentage of populated paths when sparse-checkout is enabled in 051df3c ("wt-status: show sparse checkout status as well", 2020-07-18, Git v2.28.0-rc0 -- merge listed in batch #7).
This percentage is incorrect when the index has sparse directories.
It would also be expensive to calculate as we would need to parse trees to count the total number of possible paths.

Avoid the expensive computation by simplifying the output to only report that a sparse checkout exists, without the percentage.

This change is the reason we use 'git status' --porcelain=v2 in t1092-sparse-checkout-compatibility.sh.
We don't want to ensure that this message is equal across both modes, but instead just the important information about staged, modified, and untracked files are compared.


Warning: Recent sparse-index work broke safety against attempts to add paths with trailing slashes to the index, which has been corrected with Git 2.34 (Q4 2021).

See commit c8ad9d0, commit 2a1ae64, commit fc5e90b (07 Oct 2021) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit a86ed75, 18 Oct 2021)

read-cache: let verify_path() reject trailing dir separators again

Signed-off-by: René Scharfe

6e77352 ("sparse-index: convert from full to sparse", 2021-03-30, Git v2.32.0-rc0 -- merge listed in batch #13) made verify_path() accept trailing directory separators for directories, which is necessary for sparse directory entries.
This clemency causes "git stash"(man) to stumble over sub-repositories, though, and there may be more unintended side-effects.

Avoid them by restoring the old verify_path() behavior and accepting trailing directory separators only in places that are supposed to handle sparse directory entries.


With Git 2.35 (Q1 2022), ensure that the sparseness of the in-core index matches the index.sparse configuration specified by the repository immediately after the on-disk index file is read.

See commit 7ca4fc8, commit b93fea0, commit 13f69f3, commit 336d82e (23 Nov 2021) by Victoria Dye (vdye).
(Merged by Junio C Hamano -- gitster -- in commit 5396d7b, 10 Dec 2021)

sparse-index: update do_read_index to ensure correct sparsity

Helped-by: Junio C Hamano
Co-authored-by: Derrick Stolee
Signed-off-by: Victoria Dye
Reviewed-by: Elijah Newren

Unless command_requires_full_index forces index expansion, ensure in-core index sparsity matches config settings on read by calling ensure_correct_sparsity.
This makes the behavior of the in-core index more consistent between different methods of updating sparsity: manually changing the index.sparse config setting vs.
executing git sparse-checkout --[no-]sparse-index init(man)

Although index sparsity is normally updated with git sparse-checkout init, ensuring correct sparsity after a manual index.sparse change has some practical benefits:

  1. It allows for command-by-command sparsity toggling with -c index.sparse=<true|false>, e.g. when troubleshooting issues with the sparse index.
  2. It prevents users from experiencing abnormal slowness after setting index.sparse to true due to use of a full index in all commands until the on-disk index is updated.

Warning: before Git 2.35 (Q1 2022), the sparse-index/sparse-checkout feature had a bug in its use of the matching code to determine which path is in or outside the sparse checkout patterns.

See commit 8c5de0d, commit 1b38efc (06 Dec 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit e1d9288, 15 Dec 2021)

unpack-trees: use traverse_path instead of name

Reported-by: Gustave Granroth
Reported-by: Mike Marcelais
Signed-off-by: Derrick Stolee

The sparse_dir_matches_path() method compares a cache entry that is a sparse directory entry against a 'struct traverse_info *info' and a 'struct name_entry *p' to see if the cache entry has exactly the right name for those other inputs.

This method was introduced in 523506d ("unpack-trees: unpack sparse directory entries", 2021-07-14, Git v2.33.0-rc0 -- merge listed in batch #7), but included a significant mistake.
The path comparisons used 'info->name' instead of 'info->traverse_path'.
Since 'info->name' only stores a single tree entry name while 'info->traverse_path' stores the full path from root, this method does not work when 'info' is in a subdirectory of a directory.
Replacing the right strings and their corresponding lengths make the method work properly.

The previous change included a failing test that exposes this issue.
That test now passes.
The critical detail is that as we go deep into unpack_trees(), the logic for merging a sparse directory entry with a tree entry during 'git checkout'(man) relies on this sparse_dir_matches_path() in order to avoid calling traverse_trees_recursive() during unpack_callback() in this hunk:

if (!is_sparse_directory_entry(src[0], names, info) &&
    traverse_trees_recursive(n, dirmask, mask & ~dirmask,
                  names, info) < 0) {
  return -1;
}

For deep paths, the short-circuit never occurred and traverse_trees_recursive() was being called incorrectly and that was causing other strange issues.
Specifically, the error message from the now-passing test previously included this:

error: Your local changes to the following files would be overwritten by checkout:
        deep/deeper1/deepest2/a
        deep/deeper1/deepest3/a
Please commit your changes or stash them before you switch branches.
Aborting

These messages occurred because the 'current' cache entry in twoway_merge() was showing as NULL because the index did not contain entries for the paths contained within the sparse directory entries.
We instead had 'oldtree' given as the entry at HEAD and 'newtree' as the entry in the target tree.
This led to reject_merge() listing these paths.


With Git 2.35 (Q1 2022), teach diff and blame to work well with sparse index.

See commit add4c86, commit 51ba65b, commit 338e2a9, commit 44c7e62, commit 27a443b, commit 0803f9c, commit e5b17bd (06 Dec 2021) by Lessley Dennington (ldennington).
See commit ea6ae41 (29 Nov 2021) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit 8d2c373, 21 Dec 2021)

blame: enable and test the sparse index

Signed-off-by: Lessley Dennington
Reviewed-by: Elijah Newren

Enable the sparse index for the 'git blame'(man) command.
The index was already not expanded with this command, so the most interesting thing to do is to add tests that verify that 'git blame' behaves correctly when the sparse index is enabled and that its performance improves.
More specifically, these cases are:

  1. The index is not expanded for 'blame' when given paths in the sparse checkout cone at multiple levels.

  2. Performance measurably improves for 'blame' with sparse index when given paths in the sparse checkout cone at multiple levels.

We do not include paths outside the sparse checkout cone because blame does not support blaming files that are not present in the working directory.
This is true in both sparse and full checkouts.

And:

diff: enable and test the sparse index

Co-authored-by: Derrick Stolee
Signed-off-by: Lessley Dennington
Reviewed-by: Elijah Newren

Enable the sparse index within the 'git diff'(man) command.
Its implementation already safely integrates with the sparse index because it shares code with the 'git status'(man) and 'git checkout'(man) commands that were already integrated.
For more details see:

d76723e ("status: use sparse-index throughout", 2021-07-14, Git v2.33.0-rc0 -- merge listed in batch #7) 1ba5f45 ("checkout: stop expanding sparse indexes", 2021-06-29, Git v2.33.0-rc1 -- merge)

The most interesting thing to do is to add tests that verify that 'git diff' behaves correctly when the sparse index is enabled.
These cases are:

  1. The index is not expanded for 'diff' and 'diff --staged' 2. 'diff' and 'diff --staged' behave the same in full checkout, sparse checkout, and sparse index repositories in the following partially-staged scenarios (i.e.
    the index, HEAD, and working directory differ at a given path):
  2. Path is within sparse-checkout cone
  3. Path is outside sparse-checkout cone
  4. A merge conflict exists for paths outside sparse-checkout cone

Here is a solution that will populate only files in the root folder:

$ git clone --filter=blob:none --sparse https://github.com/derrickstolee/sparse-checkout-example

Then subsequent sparse-checkout calls work like a charm.

Still no idea why the tutorial is broken.

Tags:

Git