Binary diff/patch for large files on linux?

You should probably take a look at the rsync-related tools: rdiff and rdiff-backup. The rdiff command lets you produce a patch file and apply it to some other file.

The rdiff-backup command uses this approach to deal with entire directories, but I'm guessing you're working with single-file disk images, so rdiff will be the one to use.


xdelta can do everything you want. Fair warning though, if your images aren't very similar, you can end up with a very large patch, because xdelta uses half of the defined memory buffer for finding differences. More information is available at the TuningMemoryBudget wiki page. Increasing the buffer size may help out quite a bit.

bsdiff is another option, but it's very RAM hungry and completely inappropriate for anything the size of a disk image.

bsdiff is quite memory-hungry. It requires max(17*n,9*n+m)+O(1) bytes of memory, where n is the size of the old file and m is the size of the new file. bspatch requires n+m+O(1) bytes.


Canonical Answer

Regarding rdiff the post, librsync 2.0.1 is a good read for the command functionality clarification so I've referenced that below to preserve the content to this answer if nothing else.

It's important to try to get a good understanding of the rdiff three steps to updating a file: signature, delta, and patch as talked about on the rdiff man page. I've also found an rdiff command example script on GitHub that's helpful which I'll reference and quote.

Essentially...

  1. With a "starting" or base file [file1] and you create a signature file from it
    • This is usually much smaller than the base/original file itself
  2. With the signature file you compare it against another file [file2] similar to your base file but different (e.g. recently updated) and create a delta file containing just the differences between the two files
  3. Use the "differences only" or delta file and compare it with your base file [file1] to generate a new file containing the changes from the other file [file2] matching the two.

Quick Commands (per rdiff-example.sh)

rdiff signature file1 signature-file            ## signature base file1
rdiff delta signature-file file2 delta-file     ## delta differences file2
rdiff patch file1 delta-file gen-file           ## compare delta to file1 to create matching file2

rdiff-example.sh

# $ rdiff --help
# Usage: rdiff [OPTIONS] signature [BASIS [SIGNATURE]]
#              [OPTIONS] delta SIGNATURE [NEWFILE [DELTA]]
#              [OPTIONS] patch BASIS [DELTA [NEWFILE]]

# Options:
#   -v, --verbose             Trace internal processing
#   -V, --version             Show program version
#   -?, --help                Show this help message
#   -s, --statistics          Show performance statistics
# Delta-encoding options:
#   -b, --block-size=BYTES    Signature block size
#   -S, --sum-size=BYTES      Set signature strength
#       --paranoia            Verify all rolling checksums
# IO options:
#   -I, --input-size=BYTES    Input buffer size
#   -O, --output-size=BYTES   Output buffer size

# create signature for old file
rdiff signature old-file signature-file
# create delta using signature file and new file
rdiff delta signature-file new-file delta-file
# generate new file using old file and delta
rdiff patch old-file delta-file gen-file
# test
diff -s gen-file new-file
# Files gen-file and new-file are identical

Introduction

rdiff is a program to compute and apply network deltas. An rdiff delta is a delta between binary files, describing how a basis (or old) file can be automatically edited to produce a result (or new) file.

Unlike most diff programs, librsync does not require access to both of the files when the diff is computed. Computing a delta requires just a short "signature" of the old file and the complete contents of the new file. The signature contains checksums for blocks of the old file. Using these checksums, rdiff finds matching blocks in the new file, and then computes the delta.

rdiff deltas are usually less compact and also slower to produce than xdeltas or regular text diffs. If it is possible to have both the old and new files present when computing the delta, xdelta will generally produce a much smaller file. If the files being compared are plain text, then GNU diff is usually a better choice, as the diffs can be viewed by humans and applied as inexact matches.

rdiff comes into its own when it is not convenient to have both files present at the same time. One example of this is that the two files are on separate machines, and you want to transfer only the differences. Another example is when one of the files has been moved to archive or backup media, leaving only its signature.

Symbolically

signature(basis-file) -> sig-file

delta(sig-file, new-file) -> delta-file

patch(basis-file, delta-file) -> recreated-file

Use patterns

A typical application of the rsync algorithm is to transfer a file A2 from a machine A to a machine B which has a similar file A1. This can be done as follows:

  1. B generates the rdiff signature of A1. Call this S1. B sends the signature to A. (The signature is usually much smaller than the file it describes.)
  2. A computes the rdiff delta between S1 and A2. Call this delta D. A sends the delta to B.
  3. B applies the delta to recreate A2. In cases where A1 and A2 contain runs of identical bytes, rdiff should give a significant space saving.

source

Tags:

Linux

Diff

Patch