Why doesn't Gzip compression eliminate duplicate chunks of data?

Nicole Hamilton correctly notes that gzip won't find distant duplicate data due to its small dictionary size.

bzip2 is similar, because it's limited to 900 KB of memory.

Instead, try:

LZMA/LZMA2 algorithm (xz, 7z)

The LZMA algorithm is in the same family as Deflate, but uses a much larger dictionary size (customizable; default is something like 384 MB). The xz utility, which should be installed by default on most recent Linux distros, is similar to gzip and uses LZMA.

As LZMA detects longer-range redundancy, it will be able to deduplicate your data here. However, it is slower than Gzip.

Another option is 7-zip (7z, in the p7zip package), which is an archiver (rather than a single-stream compressor) that uses LZMA by default (written by the author of LZMA). The 7-zip archiver runs its own deduplication at the file level (looking at files with the same extension) when archiving to its .7z format. This means that if you're willing to replace tar with 7z, you get identical files deduplicated. However, 7z does not preserve nanosecond timestamps, permissions, or xattrs, so it may not suit your needs.

lrzip

lrzip is a compressor that preprocesses the data to remove long-distance redundancy before feeding it to a conventional algorithm like Gzip/Deflate, bzip2, lzop, or LZMA. For the sample data you give here, it's not necessary; it's useful for when the input data is larger than what can fit in memory.

For this sort of data (duplicated incompressible chunks), you should use lzop compression (very fast) with lrzip, because there's no benefit to trying harder to compress completely random data once it's been deduplicated.

Bup and Obnam

Since you tagged the question backup, if your goal here is backing up data, consider using a deduplicating backup program like Bup or Obnam.


Gzip gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. It's a lossless data compression algorithm that works by transforming the input stream into compressed symbols using a dictionary built on-the-fly and watching for duplicates. But it can't find duplicates separated by more than 32K. Expecting it to spot duplicates 1MB apart isn't realistic.


gzip won't find duplicates, even xz with a huge dictionary size won't. What you can do is use mksquashfs - this will indeed save the space of duplicates.

Some quick test results with xz and mksquashfs with three random binary files(64MB) of which two are the same:

Setup:

mkdir test
cd test
dd if=/dev/urandom of=test1.bin count=64k bs=1k
dd if=/dev/urandom of=test2.bin count=64k bs=1k
cp test{2,3}.bin
cd ..

Squashfs:

mksquashfs test/ test.squash
> test.squash - 129M

xz:

XZ_OPT='-v --memlimit-compress=6G --memlimit-decompress=512M --lzma2=preset=9e,dict=512M --extreme -T4 ' tar -cJvf test.tar.xz test/
> test.tar.xz - 193M