Random access to gzipped files?

Yes, you can access a gzip file randomly by reading the entire thing sequentially once and building an index. See examples/zran.c in the zlib distribution.

If you are in control of creating the gzip file, then you can optimize the file for this purpose by building in random access entry points and construct the index while compressing.

You can also create a gzip file with markers by using Z_SYNC_FLUSH followed by Z_FULL_FLUSH in zlib's deflate() to insert two markers and making the next block independent of the previous data. This will reduce the compression, but not by much if you don't do this too often. E.g. once every megabyte should have very little impact. Then you can search for a nine-byte marker (with a much less probable false positive than bzip2's six-byte marker): 00 00 ff ff 00 00 00 ff ff.

You can't do that with gzip, but you can do it with bzip2, which is block instead of stream-based - this is how the Hadoop DFS splits and parallelizes the reading of huge files with different mappers in its MapReduce algorithm. Perhaps it would make sense to re-compress your files as bz2 so you can take advantage of this; it would be easier than some ad-hoc way to chunk up the files.

I found the patches that are implementing this in Hadoop, here: https://issues.apache.org/jira/browse/HADOOP-4012

Here's another post on the topic: BZip2 file read in Hadoop

Perhaps browsing the Hadoop source code would give you an idea of how to read bzip2 files by blocks.

Random access to gzipped files?

Tags:

Unix

Concurrency

Streaming

Gzip

Related

Recent Posts