Linux backup utility for incremental backups

Although tar does have an incremental mode there are a couple of more comprehensive tools to do the job:

  • Duplicity
  • Duplicati

They not only support incremental backups, it's easy to configure a schedule on which a full backup needs to be taken. For example in duplicity: duplicity --full-if-older-than 1M will make sure a full backup has run. They also support going back in time to a specific file, with plain tar you'll have to go through all incremental files till you found one which contains the right file.

Additionally they do support encryption and uploading to a variety of backends (like sftp, blob storage, etc). Obviously if you encrypt, don't forget to make a good backup of your keys to a secondary backup!

Another important aspect is that you can verify the integrity of your backups, ensuring you can restore, eg using duplicity verify.

I would negatively advise on a git based backup strategy. Large restores take significant time.


I tried rsync, but it doesn't seem to be able to do what I want, or more likely, I don't know how to make it do that.

I know I could probably create a script that runs a diff and then selects the files to backup based on the result (or more efficiently, just get a checksum and compare), but I want to know if there's any utility that can do this a tad easier :)

rsync is precisely that program that copies based on a diff. By default, it copies only when there is a difference in last-modified time or size, but it can even compare by checksum with -c.

The trouble here is that you're tar'ing the backups. This becomes easier if you don't do that. I don't even know why you're doing it. It might make sense if you're compressing them, but you're not even doing that.

The Wikipedia article for Incremental Backups has an example rsync command that goes roughly:

rsync -va \
  --link-dest="$dst/2020-02-16--05-10-45--testdir/" \
  "$src/testdir/" \
  "$dst/2020-02-17--03-24-16--testdir/"

What it does is to hardlink files from the previous backup when they are unchanged from the source. There's also --copy-dest if you want it to copy instead (it's still faster when $dst is a remote or on a faster drive).

If you use a filesystem with subvolumes like btrfs, you can also just snapshot from the previous backup before rsync'ing. Snapshots are instantaneous and don't take additional space[1].

btrfs subvolume snapshot \
  "$dst/2020-02-16--05-10-45--testdir" \
  "$dst/2020-02-17--03-24-16--testdir"

Or if you're using a filesystem that supports reflinks, like ext4, then you can also do that. Reflinks are done by making a new inode but referring to the same blocks as the source file, implementing COW support. It's still faster than regular copy because it doesn't read and write the data, and it also doesn't take additional space[1].

cp --reflink -av \
  "$dst/2020-02-16--05-10-45--testdir" \
  "$dst/2020-02-17--03-24-16--testdir"

Anyway, once having done something like that you can just do a regular rsync to copy the differences:

rsync -va \
  "$src/testdir/" \
  "$dst/2020-02-17--03-24-16--testdir/"

Though, you might want to add --delete, which would cause rsync to delete files from the destination that are no longer present in the source.

Another useful option is -i or --itemize-changes. It produces succinct, machine readable output that describes what changes rsync is doing. I normally add that option and pipe like:

rsync -Pai --delete \
  "$src/testdir/" \
  "$dst/2020-02-17--03-24-16--testdir/" \
|& tee -a "$dst/2020-02-17--03-24-16--testdir.log"

to keep record of the changes via easily grepable files. The |& is to pipe both stdout and stderr.

The -P is short for --partial and --progress. --partial keeps partially transferred files, but more importantly --progress reports per-file progress.

How this compares to archiving changes with tar

The above solutions result in directories that seem to hold everything. Even though that's the case, in total for any amount/frequency of backups, they would occupy around the same amount of space as having plain tar archives with only changes. That's because of how hardlinks, reflinks, and snapshots work. The use of bandwidth when creating the backups would also be the same.

The advantages are:

  • backups are easy to restore with rsync and faster, since rsync would only transfer the differences from the backup.
  • they're simpler to browse and modify if needed.
  • file deletions can be encoded naturally as the file's absence in new backups. When using tar archives, one would have to resort to hacks, like to delete a file foo, mark it foo.DELETED or do something complicated. I've never used duplicity for example, but looking at its documentation, it seems it encodes deletions by adding an empty file of the same name in the new tar and holding the original signature of the file in a separate .sigtar file. I imagine it compares the original signature with that of an empty file to differentiate between a file deletion and a change to an actual empty file.

If one still wants to setup each backup as only holding the files that are different (added or modified), then one can use the --link-dest solution described above and then delete the hardlinks using something like the following:

find $new_backup -type f ! -links 1 -delete

[1] Strictly speaking, they do use additional space in the form of duplicate metadata, like the filename and such. However, I think anyone would consider that insignificant.


And why are you not considering git itself?

The strategy you describe, after one full and two incremental backups, has it's complications when you continue. It is easy to make mistakes, and it can get very inefficient, depending on the changes. There would have to be a kind of rotation, ie from time to time you make a new full backup - and then do you want to keep the old one or not?


Given a working dir "testdir" containing some project (files, and subdirs), git makes by default a hidden .git subdir for the data. That would be for the local, additional version control features. For backup, you can archive/copy it away to a medium or clone it via network.

The revision control you get (without asking for) is a side effect of git's differential storage.

You can leave out all the forking/branching and so on. This means you have one branch called "master".

Before you can commit (actually write to the git archive/repo), you have to configure a minimal user for the config file. Then you should first learn and test in a subdir (maybe tmpfs). Git is just as tricky as tar, sometimes.

Anyway, as a comment says: backing up is easy, hard part is the restoring.


Disadvantages of git would be just the small overhead/overkill.

Advantages are: git tracks content and file names. It only saves what is necessary, based on a diff (for text files at least).


Example

I have 3 files in a dir. After git init, git add . and git commit I have a 260K .git dir.

Then I cp -r .git /tmp/abpic.git (a good place to save a backup:). I rm the 154K jpg, and also change one text file. I also rm -r .git.

  ]# ls
    atext  btext

  ]# git --git-dir=/tmp/abpic.git/ ls-files
    atext
    btext
    pic154k.jpg

Before restoring the files I can get the precise differences:

]# git --git-dir=/tmp/abpic.git/ status
On branch master
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   atext
        deleted:    pic154k.jpg

no changes added to commit (use "git add" and/or "git commit -a")

Here I want to follow the git restore hint.

After git --git-dir=/tmp/abpic.git/ restore \*:

]# ls -st
total 164
  4 atext  156 pic154k.jpg    4 btext

The jpeg is back, and text file btext has not been updated (keeps timestamp). The modifications in atext are overwritten.

To reunite the repo and the (working) dir you can just copy it back.

]# cp -r /tmp/abpic.git/ .git
]# git status
On branch master
nothing to commit, working tree clean

The files in the current dir are identical to the .git archive (after the restore). New changes will be displayed and can be added and committed, without any planning. You only have to store it to another medium, for backup purposes.


After a file is modified, you can use status or diff:

]# echo more >>btext 

]# git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   btext

no changes added to commit (use "git add" and/or "git commit -a")

]# git diff
diff --git a/btext b/btext
index 96b5d76..a4a6c5b 100644
--- a/btext
+++ b/btext
@@ -1,2 +1,3 @@
 This is file b
 second line
+more
#]

And just like git knows about "+more" in file 'btext', it will also only store that line incrementally.

After git add . (or git add btext) the status command switches from red to green and the commit gives you the info.

]# git add .
]# git status
On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        modified:   btext

]# git commit -m 'btext: more'
[master fad0453] btext: more
 1 file changed, 1 insertion(+)

And you can really get at the contents, somehow:

]# git ls-tree @
100644 blob 321e55a5dc61e25fe34e7c79f388101bd1ae4bbf    atext
100644 blob a4a6c5bd3359d84705e5fd01884caa8abd1736d0    btext
100644 blob 2d550ffe96aa4347e465109831ac52b7897b9f0d    pic154k.jpg

And then the first 4 hex hash digits

]# git cat-file blob a4a6
This is file b
second line
more

To travel back in time by one commit it is:

]# git ls-tree @^
100644 blob 321e55a5dc61e25fe34e7c79f388101bd1ae4bbf    atext
100644 blob 96b5d76c5ee3ccb7e02be421e21c4fb8b96ca2f0    btext
100644 blob 2d550ffe96aa4347e465109831ac52b7897b9f0d    pic154k.jpg

]# git cat-file blob 96b5
This is file b
second line

btext's blob has a different hash before the last commit, the others have the same.

An overview would be:

]# git log
commit fad04538f7f8ddae1f630b648d1fe85c1fafa1b4 (HEAD -> master)
Author: Your Name <[email protected]>
Date:   Sun Feb 16 10:51:51 2020 +0000

    btext: more

commit 0bfc1837e20988f1b80f8b7070c5cdd2de346dc7
Author: Your Name <[email protected]>
Date:   Sun Feb 16 08:45:16 2020 +0000

    added 3 files with 'add .'

Instead of manually timestamped tar files you have commits with a message and date (and an author). Logically attached to these commits are the file lists and contents.

Simple git is 20% more complicated than tar, but you get decisive 50% more functionality from it.


I wanted to make OP's third change: change a file plus two new 'picture' files. I did, but now I have:

]# git log
commit deca7be7de8571a222d9fb9c0d1287e1d4d3160c (HEAD -> master)
Author: Your Name <[email protected]>
Date:   Sun Feb 16 17:56:18 2020 +0000

    didn't add the pics before :(

commit b0355a07476c8d8103ce937ddc372575f0fb8ebf
Author: Your Name <[email protected]>
Date:   Sun Feb 16 17:54:03 2020 +0000

    Two new picture files
    Had to change btext...

commit fad04538f7f8ddae1f630b648d1fe85c1fafa1b4
Author: Your Name <[email protected]>
Date:   Sun Feb 16 10:51:51 2020 +0000

    btext: more

commit 0bfc1837e20988f1b80f8b7070c5cdd2de346dc7
Author: Your Name <[email protected]>
Date:   Sun Feb 16 08:45:16 2020 +0000

    added 3 files with 'add .'
]# 

So what did that Your Name Guy do exactly, in his two commits, shortly before 6 pm?

The last commit's details are:

]# git show
commit deca7be7de8571a222d9fb9c0d1287e1d4d3160c (HEAD -> master)
Author: Your Name <[email protected]>
Date:   Sun Feb 16 17:56:18 2020 +0000

    didn't add the pics before :(

diff --git a/picture2 b/picture2
new file mode 100644
index 0000000..d00491f
--- /dev/null
+++ b/picture2
@@ -0,0 +1 @@
+1
diff --git a/picture3 b/picture3
new file mode 100644
index 0000000..0cfbf08
--- /dev/null
+++ b/picture3
@@ -0,0 +1 @@
+2
]# 

And to check the second-to-last commit, whose message announces two pictures:

]# git show @^
commit b0355a07476c8d8103ce937ddc372575f0fb8ebf
Author: Your Name <[email protected]>
Date:   Sun Feb 16 17:54:03 2020 +0000

    Two new picture files
    Had to change btext...

diff --git a/btext b/btext
index a4a6c5b..de7291e 100644
--- a/btext
+++ b/btext
@@ -1,3 +1 @@
-This is file b
-second line
-more
+Completely changed file b
]# 

This happened because I tried git commit -a to shortcut git add ., and the two files were new (untracked). It showed in red with git status, but as I say git is not less tricky than tar, or unix.


"Your debutante just knows what you need, but I know what you want" (or the other way round. Point is it's not always the same)