What does "git ls-files" do exactly and how do we remove a file from it?

Summary

You need to wrap your head around the idea that Git stores at least three, and sometimes up to five active copies of each file: one in the current commit, one (or two or three!) in the index, and one—the only one you can see and work with—in your work-tree. The git ls-files command looks at these copies, then tells you something about some of them, depending on the flags you supply to git ls-files.

Without this idea of three-to-five copies of each file, lots of things in Git will never make any sense. (Well, some things are still tricky even with it, but that's another problem entirely. 😀)

Long

I think there are two issues here. One requires some terminology and then the other should fall into place:

Does [git ls-files] show files from the local repository,

Sort of, but:

the staging repository,

Git does not have a staging repository. Each repository has something that is called, in different Git documentation, either the index or the staging area. (There's an obsoleted third name, cache, that also appears in the Git glossary.)

the remote repository

Definitely not: there need not be any remote repositories—i.e., other Gits with their own repositories—at all, and if there are, only git fetch and git push have your Git call up their Git and exchange data with them. (Well, git ls-remote does the first little bit of git fetch, and git pull runs git fetch, so these two also exchange data with a remote. But git ls-files doesn't.)

or from somewhere else?

Yes, sort of. That gets us back to the first part. So let's take these three bits of terminology as defined in the Git glossary. Italic (including bold italic) text in below is directly from the linked documentation:

  • repository

    A collection of refs together with an object database containing all objects which are reachable from the refs, possibly accompanied by meta data from one or more porcelains. A repository can share an object database with other repositories via alternates mechanism. (all links theirs)

    This of course is full of yet more terminology. To attempt to de-mystify it a bit, what they're saying here is that the repository proper doesn't include the index and work-tree: it's mostly made up of the commits (and their contents). Of course, that requires that we define "index" and "work-tree", so let's move on to:

  • index

    A collection of files with stat information, whose contents are stored as objects. The index is a stored version of your working tree. Truth be told, it can also contain a second, and even a third version of a working tree, which are used when merging.

  • working tree (I usually call this work-tree):

    The tree of actual checked out files. The working tree normally contains the contents of the HEAD commit’s tree, plus any local changes that you have made but not yet committed.

Commits are frozen forever

When you run git commit, Git makes a snapshot of all of your files—well, all of your tracked files, anyway—and stores that, plus some metadata like your name and email address, in a commit. This commit is mostly permanent—you can get rid of commits, usually with a fair bit of difficulty, but just think of them as permanent for convenience—and is totally, completely, 100% read-only. It's read-only like this on purpose, because that allows other commits to share identical copies of files, so that if you commit the same file once, ten times, or even a million times, there's really only one copy of that file in the repository. It's only when you change the file to a new version that Git has to commit a new, separate copy.

The commits are numbered, but not by a nice easy sequential numbering system. That is, we might draw them as a series of simple numbered or lettered things:

... <-C4 <-C5 <-C6 ...

where each later commit points back to its immediate predecessor. But their actual names are big ugly hash IDs. Each one is guaranteed to be unique, which is why they have to be so big and ugly and random-looking. Each hash ID is actually a cryptographic checksum, calculated over the commit's contents, so that every Git everywhere in the universe will agree that that commit, and only that commit, gets that checksum. That's the other reason you—and even Git—can't change it: if you take a commit out of the repository database, tinker with it, and change even one single bit and then put it back into the database, what you get is a new commit with a new and different hash ID.

So commits are totally frozen, forever. The files inside them are frozen forever as well, and compressed, and in a special Git-only format. I like to call these files "freeze-dried". What this means is that, hey, they're great for archiving, but they are utterly useless for getting any new work done ... and that means that Git must provide some way of taking these freeze-dried files and rehydrating them into a useful form.

The work-tree provides the useful-form copies

Things don't really get much simpler than this: the work-tree has the useful-form, rehydrated copies of your files. Because they're just ordinary everyday files on your computer, you can see them, use them, change them around however you like, and otherwise work with them. They're technically not in the repository at all—they are more just right next to it. In a typical setup, the repository itself is in the .git directory/folder of the top level of your work-tree.

Obviously, if there's a commit you've extracted to make the work-tree, there must now be two copies of each file: the freeze-dried committed one, plus the regular working one. Git could stop here. Mercurial does stop here: if you use Mercurial instead of Git, you don't need to concern yourself with a third copy, because there is no third copy. But Git goes on to store yet more copies of the files.

The index / staging-area sits between the commit and the work-tree

What Git does here is to interpose a third copy of each file, between the freeze-dried committed copy and the work-tree copy. This third copy is in the committed-file format—i.e., pre-dehydrated–but by not being in a commit, it's not actually totally frozen: it can be replaced at any time. That's what git add does: git add takes the ordinary copy of the file from the work-tree, compresses it down into the freeze-dried format, and replaces the copy that's in the index. Or, if the file wasn't in the index at all, it puts a copy into the index.

This is why you have to git add files all the time. In Mercurial, you only hg add a file once. After that, you just run hg commit, and Mercurial looks at all the files it knows about, and freezes them into a new commit. This can take a long time, in a big repository. Git, by contrast, already has all the files it's supposed to know about, and already dehydrated, in the index, so git commit can just package up those dehydrated files into a new frozen commit. The cost of this speed is git add, but if you get into playing clever tricks with the index copies—e.g., using git add -p—you get more benefits than just the speedup.

As the Git glossary mentioned in its description of the index, the index takes on an expanded role during a conflicted merge. When you do a merge operation—whether that's from git merge, or from git revert or git cherry-pick or any other Git command that uses the merge engine—and it doesn't go smoothly, Git winds up putting all three inputs for each file into the index, so that instead of just one copy of file.ext, you get three. But as long as you're not in the middle of a merge, there's only one copy in the index.

Usually the index copy matches the HEAD frozen copy, or matches the work-tree copy, or both. For instance, after a fresh git checkout, all three copies match. Then you modify file.ext in the work-tree: now the commit and the index match, but they're not the same as the work-tree copy. Then you git add file.ext, and now the index and work-tree match, but they're different from the frozen copy. Then you git commit to make a new commit, which becomes the current commit, and all three copies match again.

Note that you can modify the work-tree copy:

vim file.ext

then copy the updated one into the index:

git add file.ext

then edit it again:

vim file.ext

and that way, you can make all three copies different. If you do that, git status will say that you have changes staged for commit, because the index copy is different from the current-commit copy, and say that you have changes not staged for commit, because the work-tree copy is different from the index copy.

The work-tree can contain files that aren't in the index at all

The index is initially just a copy of the current commit. Git then also copies those files to the work-tree, so that you can use them. But you can create files in the work-tree and not run git add on them. Those files aren't in the index now, and if you run git commit, they won't be in the new commit either, because Git builds the new commit from the index.

You can also remove files from the index, without removing them from the work-tree:

git rm --cached file.ext

removes the index copy. It can't touch the current commit frozen copy, of course, but if you now make a new commit, the new commit won't have file.ext in it at all. (The previous commit still does, of course.)

Any file that is in your work-tree right now, and is not in your index right now, is an untracked file. Its untracked-ness comes from the fact that it's not in your index. Put that file into your index and it's tracked, no matter how you got it into your index. Remove it from your index and it's untracked, no matter how you got it out of your index. So that's the last role of the index: to determine which files are tracked, and will therefore be in the next commit.

Now we can see clearly what git ls-files does

What git ls-files does is to read everything: the commit, the index, and the work-tree. Depending on what arguments you give to git ls-files, it then prints the names of some or all files that are in the index and/or in the work-tree:

git ls-files --stage

lists the files that are in the index / staging-area, along with their staging slot numbers. (It says nothing about the copies in the HEAD commit and the work-tree.) Or:

git ls-files --others

lists the (names of the) files that are in the work-tree, but not in the index. (It says nothing about the copies in the HEAD commit.) Or:

git ls-files --modified

lists the (names of the) files that are in the index and are different from their copies in the HEAD commit (or aren't in the HEAD commit at all). With no options:

git ls-files

lists the (names of the) files that are in the index, with no regard for what files are in the HEAD commit or the work-tree.


The git ls-files works correctly in your case. As your git status shows that the X file is deleted from the working dir, this means the file still exists in the index. That's why git ls-files shows X because the command shows content of the index.

Now, you have to remove that file from the index, just run:

git rm --cached <pathToXFile>

With Git 2.35 (Q1 2022), "git ls-files" learns the "--sparse" option to help debugging.

It is used with sparse index, after a git sparse checkout command.

See commit 408c51f, commit c2a2940, commit 3a9a6ac, commit 7808709, commit 5a4e054 (22 Dec 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 3c0e417, 10 Jan 2022)

ls-files: add --sparse option

Signed-off-by: Derrick Stolee

Existing callers to 'git ls-files(man) ' are expecting file names, not directories. It is best to expand a sparse index to show all of the contained files in this case.

However, expert users may want to inspect the contents of the index itself including which directories are sparse.
Add a --sparse option to allow users to request this information.

During testing, I noticed that options such as --modified did not affect the output when the files in question were outside the sparse-checkout definition.

git ls-files now includes in its man page:

--sparse

If the index is sparse, show the sparse directories without expanding to the contained files.
Sparse directories will be shown with a trailing slash, such as "x/" for a sparse directory "x".


Just wanted to share:

Refering to the accepted answer https://stackoverflow.com/a/56242906/2623045 and dicussion with https://stackoverflow.com/users/1256452/torek:

If the question was, how do I find out what files/objects should be there if I checked out a special commit, another answer might be something like:

git ls-tree -r -l HEAD

Torek also mentioned "(it's possible for HEAD to be a symbolic reference to a nonexistent branch name)" but I dont undestand that for now.

so more general:

git ls-tree -r -l commit-hash

This also works in repositories cloned with switch -n (no checkout)

Just wondering where the magic of the output is documented

extract from a repo cloned with: git clone -n https://github.com/nvie/gitflow.git

100755 blob fd16d5168d671b8f9a8a8a6a140d3f7b5dacdccd    git-flow
100644 blob 55198ad82cbfe7249951aa75f1373a476997d33a    git-flow-feature
100644 blob ba485f6fe4b7d9c35bc01d2a6bd4ae201bccc9bd    git-flow-hotfix
100644 blob 5b4e7e807423279d5983c28b16307e40dfdb51d7    git-flow-init
100644 blob cb95bd486deb7089939362705d78b2197893f578    git-flow-release
100644 blob cdbfc717c0f1eb9e653a4d10d7c4df261ed40eab    git-flow-support
100644 blob 8c314996c0ac31f1396c48af5c6511124002dab7    git-flow-version
100644 blob 33274053347f4eec2f27dd8bceca967b89ae02d5    gitflow-common
120000 blob 7b736c183c7f6400b20ea613183d74a55ead78b5    gitflow-shFlags
160000 commit 2fb06af13de884e9680f14a00c82e52a67c867f1  shFlags

My interpretation:

The hashes seem to be "blob checksums" (no commit hashes). The same checksum can appear more than once if more than one file was in a commit. The last three nibbles of e.g. 100644 look like linux file access properties (rw-r--r--). The first three nibbles are not 100 if the object is not a regular file. In real life gitflow-shFlags is a symlink and shflags a submodule directory.

EDIT: Just stumbled over https://github.com/git/git/blob/master/Documentation/technical/index-format.txt (GOOGLE: git --index-info, STACKOVERFLOW: What does the git index contain EXACTLY?)

32-bit mode, split into (high to low bits)

  4-bit object type
  valid values in binary are 1000 (regular file), 1010 (symbolic link)
  and 1110 (gitlink)

  3-bit unused

  9-bit unix permission. Only 0755 and 0644 are valid for regular files.
  Symbolic links and gitlinks have value 0 in this field.

So if you interpret the nibbles as octal values

100644: 1'000' 000'110'100'100 --> object type is regular file

120000: 1'010' 000'000'000'000 --> object type is symbolic link

160000: 1'110' 000'000'000'000 --> object type is gitlink

OMG: Why is it so hard extracting such information from the git man pages directly?

Next questions: What is 'gitlink'? Is it only associated with git submodules?

Tags:

Git