Print archive file list instantly (without decompressing entire archive)

It's important to understand there's a trade-off here.

tar means tape archiver. On a tape, you do mostly sequential reading and writing. Tapes are rarely used nowadays, but tar is still used for its ability to read and write its data as a stream.

You can do:

tar cf - files | gzip | ssh host 'cd dest && gunzip | tar xf -'

You can't do that with zip or the like.

You can't even list the content of a zip archive without storing it locally in a seekable file first. Things like:

curl -s https://github.com/dwp-forge/columns/archive/v.2016-02-27.zip | unzip -l /dev/stdin

won't work.

To achieve that quick reading of the content, zip or the like need to build an index. That index can be stored at the beginning of the file (in which case it can only be written to regular files, not streams), or at the end, which means the archiver needs to remember all the archive members before printing it in the end and means a truncated archive may not be recoverable.

That also means archive members need to be compressed individually which means a much lower compression ratio especially if there's a lot of small files.

Another drawback with formats like zip is that the archiving is linked to the compressing, you can't choose the compression algorithm. See how tar archives used to be compressed with compress (tar.Z), then with gzip, then bzip2, then xz as new more performant compression algorithms were devised. Same goes for encryption. Who would trust zip's encryption nowadays?

Now, the problem with tar.gz archives is not that much that you need to uncompress them. Uncompressing is often faster than reading off a disk (you'll probably find that listing the content of a large tgz archive is quicker that listing the same one uncompressed when not cached in memory), but that you need to read the whole archive.

Not being able to read the index quickly is not really a problem. If you do foresee needing to read the table content of an archive often, you can just store that list in a separate file. For instance, at creation time, you can do:

tar cvvf - dir 2> file.tar.xz.list | xz > file.tar.xz

A bigger problem IMO is the fact that because of the sequential aspect of the archive, you can't extract individual files without reading the whole beginning section of the archive that leads to it. IOW, you can't do random reads within the archive.

Now, for seekable files, it doesn't have to be that way.

If you compress your tar archive with gzip, that compresses it as a whole, the compression algorithm uses data seen at the beginning to compress, so you have to start from the beginning to uncompress.

But the xz format can be configured to compress data in separate individual chunks (large enough so as the compression to be efficient), that means that as long as you keep an index at the end of those compressed chunks, for seekable files, you access the uncompressed data randomly (in chunks at least).

pixz (parallel xz) uses that capability when compressing tar archives to also add an index of the start of each member of the archive at the end of the xz file.

So, for seekable files, not only can you get a list of the content of the tar archive instantly (without metadata though) if they have been compressed with pixz:

pixz -l file.tar.xz

But you can also extract individual elements without having to read the whole archive:

pixz -x archive/member.txt < file.tar.xz | tar xpf -

Now, as to why things like 7z or zip are rarely used on Unix is mostly because they can't archive Unix files. They've been designed for other operating systems. You can't do a faithful backup of data using those. They can't store metadata like owner (id and name), permission, they can't store symlinks, devices, fifos..., they can't store information about hard links, and other metadata information like extended attributes or ACLs.

Some of them can't even store members with arbitrary names (some will choke on backslash or newline or colon, or non-ascii filenames) (some tar formats also have limitations though).

Never uncompress a tgz/tar.xz file to disk!

In case it is not obvious, one doesn't use a tgz or tar.bz2, tar.xz... archive as:

unxz file.tar.xz
tar tvf file.tar
xz file.tar

If you've got an uncompressed .tar file lying about on your file system, it's that you've done something wrong.

The whole point of those xz/bzip2/gzip being stream compressors is that they can be used on the fly, in pipelines as in

unxz < file.tar.xz | tar tvf -

Though modern tar implementations know how to invoke unxz/gunzip/bzip2 by themselves, so:

tar tvf file.tar.xz

would generally also work (and again uncompress the data on the fly and not store the uncompressed version of the archive on disk).

Example

Here's a Linux kernel source tree compressed with various formats.

$ ls --block-size=1 -sS1
666210304 linux-4.6.tar
173592576 linux-4.6.zip
 97038336 linux-4.6.7z
 89468928 linux-4.6.tar.xz

First, as noted above, the 7z and zip ones are slightly different because they can't store the few symlinks in there and are missing most of the metadata.

Now a few timings to list the content after having flushed the system caches:

$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3
$ time tar tvf linux-4.6.tar > /dev/null
tar tvf linux-4.6.tar > /dev/null  0.56s user 0.47s system 13% cpu 7.428 total
$ time tar tvf linux-4.6.tar.xz > /dev/null
tar tvf linux-4.6.tar.xz > /dev/null  8.10s user 0.52s system 118% cpu 7.297 total
$ time unzip -v linux-4.6.zip > /dev/null
unzip -v linux-4.6.zip > /dev/null  0.16s user 0.08s system 86% cpu 0.282 total
$ time 7z l linux-4.6.7z > /dev/null
7z l linux-4.6.7z > /dev/null  0.51s user 0.15s system 89% cpu 0.739 total

You'll notice listing the tar.xz file is quicker than the .tar one even on this 7 years old PC as reading those extra megabytes from the disk takes longer than reading and decompressing the smaller file.

Then OK, listing the archives with 7z or zip is quicker but that's a non-problem as as I said, it's easily worked around by storing the file list alongside the archive:

$ tar tvf linux-4.6.tar.xz | xz > linux-4.6.tar.xz.list.xz
$ ls --block-size=1 -sS1 linux-4.6.tar.xz.list.xz
434176 linux-4.6.tar.xz.list.xz
$ time xzcat linux-4.6.tar.xz.list.xz > /dev/null
xzcat linux-4.6.tar.xz.list.xz > /dev/null  0.05s user 0.00s system 99% cpu 0.051 total

Even faster than 7z or zip even after dropping caches. You'll also notice that the cumulative size of the archive and its index is still smaller than the zip or 7z archives.

Or use the pixz indexed format:

$ xzcat linux-4.6.tar.xz | pixz -9  > linux-4.6.tar.pixz
$ ls --block-size=1 -sS1 linux-4.6.tar.pixz
89841664 linux-4.6.tar.pixz
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3
$ time pixz -l linux-4.6.tar.pixz > /dev/null
pixz -l linux-4.6.tar.pixz > /dev/null  0.04s user 0.01s system 57% cpu 0.087 total

Now, to extract individual elements of the archive, the worst case scenario for a tar archive is when accessing the last element:

$ xzcat linux-4.6.tar.xz.list.xz|tail -1
-rw-rw-r-- root/root      5976 2016-05-15 23:43 linux-4.6/virt/lib/irqbypass.c
$ time tar xOf linux-4.6.tar.xz linux-4.6/virt/lib/irqbypass.c | wc
    257     638    5976
tar xOf linux-4.6.tar.xz linux-4.6/virt/lib/irqbypass.c  7.27s user 1.13s system 115% cpu 7.279 total
wc  0.00s user 0.00s system 0% cpu 7.279 total

That's pretty bad as it needs to read (and uncompress) the whole archive. Compare with:

$ time unzip -p linux-4.6.zip linux-4.6/virt/lib/irqbypass.c | wc
    257     638    5976
unzip -p linux-4.6.zip linux-4.6/virt/lib/irqbypass.c  0.02s user 0.01s system 19% cpu 0.119 total
wc  0.00s user 0.00s system 1% cpu 0.119 total

My version of 7z seems not to be able to do random access, so it seems to be even worse than tar.xz:

$ time 7z e -so linux-4.6.7z linux-4.6/virt/lib/irqbypass.c 2> /dev/null | wc
    257     638    5976
7z e -so linux-4.6.7z linux-4.6/virt/lib/irqbypass.c 2> /dev/null  7.28s user 0.12s system 89% cpu 8.300 total
wc  0.00s user 0.00s system 0% cpu 8.299 total

Now since we have our pixz generated one from earlier:

$ time pixz < linux-4.6.tar.pixz -x linux-4.6/virt/lib/irqbypass.c  | tar xOf - | wc
    257     638    5976
pixz -x linux-4.6/virt/lib/irqbypass.c < linux-4.6.tar.pixz  1.37s user 0.06s system 84% cpu 1.687 total
tar xOf -  0.00s user 0.01s system 0% cpu 1.693 total
wc  0.00s user 0.00s system 0% cpu 1.688 total

It's faster but still relatively slow because the archive contains few large blocks:

$ pixz -tl linux-4.6.tar.pixz
 17648865 / 134217728
 15407945 / 134217728
 18275381 / 134217728
 19674475 / 134217728
 18493914 / 129333248
   336945 /   2958887

So pixz still needs to read and uncompress a (up to a) ~19MB large chunk of data.

We can make random access faster by making archives will smaller blocks (and sacrifice a bit of disk space):

$ pixz -f0.25 -9 < linux-4.6.tar > linux-4.6.tar.pixz2
$ ls --block-size=1 -sS1 linux-4.6.tar.pixz2
93745152 linux-4.6.tar.pixz2
$ time pixz < linux-4.6.tar.pixz2 -x linux-4.6/virt/lib/irqbypass.c  | tar xOf - | wc
    257     638    5976
pixz -x linux-4.6/virt/lib/irqbypass.c < linux-4.6.tar.pixz2  0.17s user 0.02s system 98% cpu 0.189 total
tar xOf -  0.00s user 0.00s system 1% cpu 0.188 total
wc  0.00s user 0.00s system 0% cpu 0.187 total

why people use it so much despite this drawback?

Corporate and Academic admins are often more noticed when things break, than appreciated when things run efficiently. Such environments breed fear of experimentation, and scorn for novelty.

what choice(I mean other software/tool) do I have if I want the "instant content listing" capability?

dar (Disk Archiver) features a raft of tar-like features, plus enhancements such as speedy random access for compressed archives, AKA cataloging, AKA indexing, AKA "instant content listing"...

See also: Compression formats with good support for random access within archives?

Print archive file list instantly (without decompressing entire archive)

Never uncompress a tgz/tar.xz file to disk!

Example

Tags:

Compression

Archive

Tar

Gzip

Related

Recent Posts