Why can't Git handle large files and large repos?

Basically, it comes down to tradeoffs.

One of your questions has an example from Linus himself:

...CVS, ie it really ends up being pretty much oriented to a "one file at a time" model.

Which is nice in that you can have a million files, and then only check out a few of them - you'll never even see the impact of the other 999,995 files.

Git fundamentally never really looks at less than the whole repo...So git scales really badly if you force it to look at everything as one huge repository...

And yes, then there's the "big file" issues. I really don't know what to do about huge files. We suck at them, I know.

Just as you won't find a data structure with O(1) index access and insertion, you won't find a content tracker that does everything fantastically.

Git has deliberately chosen to be better at some things, to the detriment of others.

Disk usage

Since Git is DVCS (Distributed version control system), everyone has a copy of the entire repo (unless you use the relatively recent shallow clone).

This has some really nice advantages, which is why DVCSs like Git have become insanely popular.

However, a 4 TB repo on a central server with SVN or CVS is manageable, whereas if you use Git, everyone won't be thrilled with carrying that around.

Git has nifty mechanisms for minimizing the size of your repo by creating delta chains ("diffs") across files. Git isn't constrained by paths or commit orders in creating these, and they really work quite well....kinda of like gzipping the entire repo.

Git puts all these little diffs into packfiles. Delta chains and packfiles makes retrieving objects take a little longer, but this it is very effective at minimizing disk usage. (There's those tradeoffs again.)

That mechanism doesn't work as well for binary files, as they tend to differ quite a bit, even after a "small" change.

History

When you check in a file, you have it forever and ever. Your grandchildren's grandchildren's grandchildren will download your cat gif every time they clone your repo.

Git's content-based design (each object id is a SHA of its content) makes permanently removing those files difficult, invasive, and destructive to history. In contrast, I can delete crufty binary from an artifact repo, or an S3 bucket, without affecting the rest of my content.

Difficulty

Working with really large files requires a lot of careful work, to make sure you minimize your operations, and never load the whole thing in memory. This is extremely difficult to do reliably when creating a program with as complex a feature set as git.

Conclusion

Ultimately, developers who say "don't put large files in Git" are a bit like those who say "don't put large files in databases". They don't like it, but any alternatives have disadvantages (Git intergration in the one case, ACID compliance and FKs with the other). In reality, it usually works okay, especially if you have enough memory.

It wasn't designed for that, so it won't excel.

It's not true that git "can't handle" large files. It's just that you probably don't want to use git to manage a repository of large binary files, because a git repo contains the complete history of every file, and delta compression is much less effective on most kinds of binary files than it is on text files. The result is a very large repo that takes a long time to clone, uses a lot of disk space, and might be unacceptably slow for other operations because of the sheer amount of data it has to go through.

Alternatives and add-ons like git-annex store the revisions of large binary files separately, in a way that breaks git's usual assumption of having every previous state of the repository available offline at any time, but avoids having to ship such large amounts of data.

Why can't Git handle large files and large repos?

Tags:

Git

Related

Recent Posts