Is there a smarter tar or cpio out there for efficiently retrieving a file stored in the archive?

Solution 1:

tar (and cpio and afio and pax and similar programs) are stream-oriented formats - they are intended to be streamed direct to a tape or piped into another process. while, in theory, it would be possible to add an index at the end of the file/stream, i don't know of any version that does (it would be a useful enhancement though)

it won't help with your existing tar or cpio archives, but there is another tool, dar ("disk archive"), that does create archive files that contain such an index and can give you fast direct access to individual files within the archive.

if dar isn't included with your unix/linux-dist, you can find it at:

http://dar.linux.free.fr/

Solution 2:

You could use SquashFS for such archives. It is

  • designed to be accessed using a fuse driver (although a traditional interface exists)
  • compressed (the larger the block size, the more efficient)
  • included in the Linux kernel
  • stores UIDs/GIDs and creation time
  • endianess-aware, therefore quite portable

The only drawback I know of is that it is read-only.

http://squashfs.sourceforge.net/ http://www.tldp.org/HOWTO/SquashFS-HOWTO/whatis.html


Solution 3:

While it doesn't store an index, star is purported to be faster than tar. Plus it supports longer filenames and has better support for file attributes.

As I'm sure you're aware, decompressing the file takes time and would likely be a factor in the speed of extraction even if there was an index.

Edit: You might also want to take a look at xar. It has an XML header that contains information about the files in the archive.

From the referenced page:

Xar's XML header allows it to contain arbitrary metadata about files contained within the archive. In addition to the standard unix file metadata such as the size of the file and it's modification and creation times, xar can store information such as ext2fs and hfs file bits, unix flags, references to extended attributes, Mac OS X Finder information, Mac OS X resource forks, and hashes of the file data.


Solution 4:

Thorbjørn Ravn Anderser is right. GNU tar creates "seekable" archives by default. But it does not use that information when it reads these archives if -n option is not given. With -n option I just extracted 7GB file from 300GB archive in time required to read/write 7GB. Without -n it took more than hour and produced no result.

I'm not sure how compression affects this. My archive was not compressed. Compressed archives are not "seekable" because current (1.26) GNU tar offloads compression to external program.


Solution 5:

The only archive format I know of that stores an index is ZIP, because I've had to reconstruct corrupted indexes more than once.