Why are tar archive formats switching to xz compression to replace bzip2 and what about gzip?

For distributing archives over the Internet, the following things are generally a priority:

Compression ratio (i.e., how small the compressor makes the data);
Decompression time (CPU requirements);
Decompression memory requirements; and
Compatibility (how wide-spread the decompression program is)

Compression memory & CPU requirements aren't very important, because you can use a large fast machine for that, and you only have to do it once.

Compared to bzip2, xz has a better compression ratio and lower (better) decompression time. It, however—at the compression settings typically used—requires more memory to decompress^[1] and is somewhat less widespread. Gzip uses less memory than either.

So, both gzip and xz format archives are posted, allowing you to pick:

Need to decompress on a machine with very limited memory (<32 MB): gzip. Given, not very likely when talking about kernel sources.
Need to decompress minimal tools available: gzip
Want to save download time and/or bandwidth: xz

There isn't really a realistic combination of factors that'd get you to pick bzip2. So its being phased out.

I looked at compression comparisons in a blog post. I didn't attempt to replicate the results, and I suspect some of it has changed (mostly, I expect xz has improved, as its the newest.)

(There are some specific scenarios where a good bzip2 implementation may be preferable to xz: bzip2 can compresses a file with lots of zeros and genome DNA sequences better than xz. Newer versions of xz now have an (optional) block mode which allows data recovery after the point of corruption and parallel compression and [in theory] decompression. Previously, only bzip2 offered these.^[2] However none of these are relevant for kernel distribution)

1: In archive size, xz -3 is around bzip -9. Then xz uses less memory to decompress. But xz -9 (as, e.g., used for Linux kernel tarballs) uses much more than bzip -9. (And even xz -0 needs more than gzip -9).

2: F21 System Wide Change: lbzip2 as default bzip2 implementation

First of all, this question is not directly related to tar. Tar just creates an uncompressed archive, the compression is then applied later on.

Gzip is known to be relatively fast when compared to LZMA2 and bzip2. If speed matters, gzip (especially the multithreaded implementation pigz) is often a good compromise between compression speed and compression ratio. Although there are alternatives if speed is an issue (e.g. LZ4).

However, if a high compression ratio is desired LZMA2 beats bzip2 in almost every aspect. The compression speed is often slower, but it decompresses much faster and provides a much better compression ratio at the cost of higher memory usage.

There is not much reason to use bzip2 any more, except of backwards compatibility. Furthermore, LZMA2 was desiged with multithreading in mind and many implementations by default make use of multicore CPUs (unfortunately xz on Linux does not do this, yet). This makes sense since the clock speeds won't increase any more but the number of cores will.

There are multithreaded bzip2 implementations (e.g. pbzip), but they are often not installed by default. Also note that multithreaded bzip2 only really pay off while compressing whereas decompression uses a single thread if the file was compress using a single threaded bzip2, in contrast to LZMA2. Parallel bzip2 variants can only leverage multicore CPUs if the file was compressed using a parallel bzip2 version, which is often not the case.

Short answer : xz is more efficient in terms of compression ratio. So it saves disk space and optimizes the transfer through the network.
You can see this Quick Benchmark so as to discover the difference by practical tests.

Why are tar archive formats switching to xz compression to replace bzip2 and what about gzip?

Tags:

Gzip

History

Bzip2

Xz

Related

Recent Posts