How can I filter the contents of a tar file, producing another tar file in the pipe?

bsdtar (based on libarchive) can filter tar (and some other archives) from stdin to stdout. It can for example pass through only filenames matching a pattern, and can do s/old/new/ renaming. It's already packaged for most distros, for example as bsdtar in Ubuntu.

sudo apt-get install bsdtar   # or aptitude, if you have it.

# example from the man page:
bsdtar -c -f new.tar --include='*foo*' @old.tgz
#create new.tar containing only entries from old.tgz containing the string ‘foo’
bsdtar -czf - --include='*foo*' @-  # filter stdin to stdout, with gzip compression of output.

Note that has a wide choice of compression formats for input/output, so you don't have to manually pipe through gunzip / lz4 yourself. You can use - for stdin with the @tarfile syntax, and/or - for stdout like normal.


My searching also found this streaming tar modify tool which appears to want you to define the archive changes you want using javascript. (I think the whole thing is written in js).

https://github.com/mafintosh/tar-stream


The easiest way would be to copy the whole archive; I presume you don't want to do that because it's too large.

The usual command line tools (tar, pax) don't support copying members of an archive to another archive.

If you didn't need to preserve ownership, I'd suggest using FUSE filesystems. You can use archivemount to mount an archive as a filesystem; do this for the source archive, and run tar on the mounted filesystem.

archivemount some.tar.gz mnt
cd mnt
tar -cz subdir | ssh example.com tar -xz
fusermount -u mnt

Alternatively, you can use AVFS:

mountavfs
cd ~/.avfs$PWD/some.tar.gz\#
tar -cz subdir | ssh example.com tar -xz

Alternatively, you can run tar on the original archive and extract to the remote machine over SSHFS.

sshfs example.com: mnt
cd mnt
tar -xf /path/to/some.tar.gz subdir
fusermount -u mnt

However all of these methods are cumbersome if you need to preserve ownership. They all involve extracting to a file on the local machine, so this file's ownership will have to be the intended remote ownership. This requires running as root and may not give the intended result if the files are owned by accounts which have names or IDs that differ between the local machine and the remote host.

Python's tarfile library provides a fairly easy way to manipulate tar members, so you can shuffle them from one tar file to another. It supports POSIX standard formats (ustar, pax) as well as some GNU extensions. Here's an untested Python script that reads a tar file (possibly compressed with gzip or bzip2) on its standard input and writes a tar file compressed with bzip2 on its standard output. The members from the source are copied if they start with the argument passed to the script.

#!/usr/bin/env python2
import sys, tarfile
source = tarfile.open(fileobj=sys.stdin)
destination = tarfile.open(fileobj=sys.stdout, mode='w:bz2')
for info in source:
    if info.name.startswith(sys.argv[1]):
        destination.addfile(info)
destination.close()

To be invoked as

tar_filter <some.tar.gz subdir/ | ssh example.com tar -xj

Tags:

Pipe

Tar