What method does unzip use to find a single file in an archive?

When searching for a single file in a large archive, it uses method 1, which you can see using strace:

open("dataset.zip", O_RDONLY)           = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920)    = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive:  dataset.zip\n", 22Archive:  dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET)           = 943718400
read(3, "\340P\356(s\342\306\205\201\27\360U[\250/2\207\346<\252+u\234\225\1[<\2310E\342\274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET)           = 943722880
read(3, "\3\f\225P\\ux\v\0\1\4\350\3\0\0\4\350\3\0\0", 20) = 20
lseek(3, 943718400, SEEK_SET)           = 943718400
read(3, "\340P\356(s\342\306\205\201\27\360U[\250/2\207\346<\252+u\234\225\1[<\2310E\342\274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET)           = 849346560
read(3, "D\262nv\210\343\240C\24\227\344\367q\300\223\231\306\330\275\266\213\276M\7I'&35\2\234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790)    = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt        "..., 37 extracting: rand-28.txt             ) = 37
read(3, "\275\3279Y\206\223\217}\355W%:\220YNT\0\257\260z^\361T\242\2\370\21\336\372+\306\310"..., 8192) = 8192

unzip opens dataset.zip, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt, at offset 849346560) and reads from there.

The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:

/*---------------------------------------------------------------------------
    Find and process the end-of-central-directory header.  UnZip need only
    check last 65557 bytes of zipfile:  comment may be up to 65535, end-of-
    central-directory record is 18 bytes, and signature itself is 4 bytes;
    add some to allow for appended garbage.  Since ZipInfo is often used as
    a debugging tool, search the whole zipfile if zipinfo_mode is true.
  ---------------------------------------------------------------------------*/

Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.

The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c:

/*--------------------------------------------------------------------------- 
    The basic idea of this function is as follows.  Since the central di- 
    rectory lies at the end of the zipfile and the member files lie at the 
    beginning or middle or wherever, it is not very desirable to simply 
    read a central directory entry, jump to the member and extract it, and 
    then jump back to the central directory.  In the case of a large zipfile 
    this would lead to a whole lot of disk-grinding, especially if each mem- 
    ber file is small.  Instead, we read from the central directory the per- 
    tinent information for a block of files, then go extract/test the whole 
    block.  Thus this routine contains two small(er) loops within a very 
    large outer loop:  the first of the small ones reads a block of files 
    from the central directory; the second extracts or tests each file; and 
    the outer one loops over blocks.  There's some file-pointer positioning 
    stuff in between, but that's about it.  Btw, it's because of this jump- 
    ing around that we can afford to be lenient if an error occurs in one of 
    the member files:  we should still be able to go find the other members, 
    since we know the offset of each from the beginning of the zipfile. 
  ---------------------------------------------------------------------------*/

The format itself is mostly derived from PK-Ware's implementation, and is summarized in programming information text-files. According to that, there's more than one type of record in the central directory as well, so unzip cannot readily go to the end of the file and make an array of entries to lookup the target file.

Now... if you take the time to read the source code, you'll discover that unzip reads buffers of 8192 bytes (look for INBUFSIZ). I'd only use the single-file extract for a fairly large zip file (I had in mind the Java sources), but even for a smaller zip-file, you can see the effect of the buffer size. To see this, I zipped up the Git files for PuTTY, which gave 2727 files (counting a copy of the git log). Java was bigger than than 20 years ago, and hasn't shrunk. Extracting that log from the zip-file (chosen since it wouldn't be at the end of an alphabetically-sorted index and likely not in the first block read from the central directory) gave this from strace for the lseek calls:

lseek(3, -2252, SEEK_CUR)               = 1267
lseek(3, 120463360, SEEK_SET)           = 120463360
lseek(3, 120468731, SEEK_SET)           = 120468731
lseek(3, 120135680, SEEK_SET)           = 120135680
lseek(3, 270336, SEEK_SET)              = 270336
lseek(3, 120463360, SEEK_SET)           = 120463360

As usual, with benchmarks, ymmv.

Tags:

Zip

Archive