How to use git annex on an existing repository?

If you just remove the files from the most recent commit and start using git-annex now, it will work, but your existing git repository will not get any smaller. This is because your history still contains all the big files checked into Git.

You might be able to use git-filter-branch to rewrite your commits to remove the big files and annex them, as if they had been there all along. That command would probably look something like the following. I haven't tested this myself since I don't have git-annex installed, so you should clone your repo and test it there first!

git filter-branch --tree-filter 'find . -size +5M -type f -not -ipath \*.git/\* -print0 | xargs -0 git rm --cached;find . -size +5M -type f -not -ipath \*.git/\* -print0 | xargs -0 git annex add' HEAD

Step by step, what that hopefully does is:

  1. git filter-branch --tree-filter '<commands>' HEAD

    Rewrite trees for all commits reachable from HEAD.

  2. find . -size +5M -type f -not -ipath \*.git/\* -print0 | xargs -0 git rm --cached;

    For each commit, find all files larger than 5MB in the repo (minus the .git directory) and remove them from the index.

  3. find . -size +5M -type f -not -ipath \*.git/\* -print0 | xargs -0 git annex add

    Find all files larger than 5MB in the repo and add them to the annex


This has been touched on some at the git-annex page: http://git-annex.branchable.com/forum/migrate_existing_git_repository_to_git-annex/

My experience was less complicated, I did not need to edit .gitattributes and therefore did not need to do a bunch of rebases on the front end. I also only had one branch.

git filter-branch  --tag-name-filter cat --tree-filter 'mkdir -p .git-annex; cp ${MYWORKDIR}/.tmp/* .git-annex/; find . -size +5M -type f -not -ipath \*.git\* -not -ipath \*.temp\* -print0 | parallel -0 -j1 ~/bin/gax; git reset HEAD .git-rewrite; :' -- master

The script that GNU parallel is calling: ~/bin/gax looks like this:

#!/bin/bash
f=$1;
git annex add ${f};
annexdest=$(readlink ${f});
ln -sf ${annexdest#../../} ${f};

The script could be made faster by passing all the files at once (git annex ignores adds for non-existent files), but you would have to do loop over the symlink part to fix them all.

The filter-branch command could also be made faster by first generating the list of files using find, and using that list instead of running find on the working tree every time.

Tags:

Git

Git Annex