Efficiently remove file(s) from large .tgz

With GNU tar, you can do:

pigz -d < file.tgz |
  tar --delete --wildcards -f - '*/prefix*.jpg' |
  pigz > newfile.tgz

With bsdtar:

pigz -d < file.tgz |
  bsdtar -cf - --exclude='*/prefix*.jpg' @- |
  pigz > newfile.tgz

(pigz being the multi-threaded version of gzip).

You could overwrite the file over itself like:

{ pigz -d < file.tgz |
    tar --delete --wildcards -f - '*/prefix*.jpg' |
    pigz &&
    perl -e 'truncate STDOUT, tell STDOUT'
} 1<> file.tgz

But that's quite risky, especially if the result ends up being less compressed than the original file (in which case, the second pigz may end up overwriting areas of the file which the first one has not read yet).


Don't discount the easy way: it may be fast enough for your purpose. With avfs to access the archive as a directory:

cd ~/.avfs/path/to/original.tar.gz\#
pax -w -s '/^.*\.jpg$//' | gzip >/path/to/filtered.tar.gz        # POSIX
tar -czf /path/to/filtered.tar.gz -s '/^.*\.jpg$//' .            # BSD
tar -czf /path/to/filtered.tar.gz --transform '/^.*\.jpg$//' .   # GNU

With more primitive tools, first extract the files excluding the .jpg files, then create a new archive.

mkdir tmpdir && cd tmpdir
<original.tar.gz gzip -d | pax -r -pe -s '/^.*\.jpg$//'
pax -w . | gzip >filtered.tar.gz
cd .. && rm -rf tmpdir

If your tar has --exclude:

mkdir tmpdir && cd tmpdir
tar -xzf original.tar.gz --exclude='*.jpg'
tar -czf filtered.tar.gz .
cd .. && rm -rf tmpdir

This may however mangle file ownership and modes if you don't run it as root. For best results, use a temporary directory on a fast filesystem — tmpfs if you have one that's large enough.

Support for archivers to act as a pass-through (i.e read an archive and write an archive) tends to be limited. GNU tar can delete members from an archive with the --delete operation option (“The --delete option has been reported to work properly when tar acts as a filter from stdin to stdout.”), and that's probably your best option.

You can make powerful archive filters in a few lines of Python. Its tarfile library can read and write from non-seekable streams, and you can use arbitrary code in Python to filter, rename, modify…

#!/usr/bin/python
import re, sys, tarfile
source = tarfile.open(fileobj=sys.stdin, mode='r|*')
dest = tarfile.open(fileobj=sys.stdout, mode='w|gz')
for member in source:
    if not (member.isreg() and re.match(r'.*\.jpg\Z', member.name)):
        sys.stderr.write(member.name + '\n')
        dest.addfile(member, source.extractfile(member))
dest.close()

With the tar that comes on Mac OSX, you could do this:

tar -czf b.tgz --exclude '*.jpg' @a.tgz
mv b.tgz a.tgz

Tags:

Tar

Gzip