Reading partially downloaded gzip with an offset

gzip doesn’t produce block-compressed files (see the RFC for the gory details), so it’s not suitable for random access on its own. You can start reading from a stream and stop whenever you want, which is why your curl -r 0-2024 example works, but you can’t pick up a stream in the middle, unless you have a complementary file to provide the missing data (such as the index files created by gztool).

To achieve what you’re trying to do, you need to use block compression of some sort; e.g. bgzip (which produces files which can be decompressed by plain gzip) or bzip2, and do some work on the receiving end to determine where the block boundaries lie. Peter Cock has written a few interesting posts on the subject: BGZF - Blocked, Bigger & Better GZIP!, Random access to BZIP2?

Just FWIW, gzip can be accessed randomly, if a previous index file has been created...

I've developed a command line tool that can quickly and (almost-)randomly access a gzip if an index is provided (if it is not provided, it is automatically created):

https://github.com/circulosmeos/gztool

gztool can be used to access chunks of the original gzip file, if those chunks are retrieved at the specific byte points the index has pointed to (-1 byte to be sure, because gzip is a stream of bits, not bytes), or better, after them.

For example if an index points starts (gztool -ll index.gzi provides this data) at compressed-byte 1508611 of the gzip file and we want 1M compressed bytes after that:

$ curl -r 1508610-2508611 https://example.com/db/backups/db.sql.gz > chunk.gz

Note that chunk.gz will occupy on disk only the chunk size!
Also note that it is not a valid gzip file, as it is incomplete.
Also take into account that we have retrieve from desired index-point-position minus 1 byte.

Now the complete index (previously only-once created: for example with gztool -i *.gz to create indexes for all your already gzipped files, or gztool -c * to both compress and create index) must also be retrieved. Note that indexes are ~0.3% of gzip size (or much smaller if gztool compresses the data itself).

$ curl https://example.com/db/backups/db.sql.gzi -o chunk.gzi

And now the extraction can be done with gztool. The corresponding uncompressed byte (or a byte passed that one) of compressed-1508610 must be known, but the index can show this info with gztool -ll. See examples here. Let's suppose it is byte 9009009. Or the uncompressed byte we want is just passed the corresponding first index point that is contained in chunk.gz. Let's suppose again that this byte would also be 9009009 for this case.

$ gztool -n 1508610 -b 9009009 chunk.gz > extracted_chunk.sql

gztool will stop extracting data when the chunk.gz file ends.

Maybe tricky, but would run without changing compression method nor already compressed files. But indexes would need to be created for them.

NOTES: Another way to do the extraction without using the -n parameter is filling the gzip file with sparse zeroes: this is done for the example with a dd command before the first curl for retrieving the chunk.gz file, so:

$ dd if=/dev/zero of=chunk.gz seek=1508609 bs=1 count=0
$ curl -r 1508610-2508611 https://example.com/db/backups/db.sql.gz >> chunk.gz
$ curl https://example.com/db/backups/db.sql.gzi -o chunk.gzi

This way, the first 1508609 bytes of the file are zeroes, but they don't occupy space in disk. Without seek in dd command, the zeroes are all written to disk, which will be also valid for gzip, but this way we don't occupy unnecessary space on disk. Then, the gztool command doesn't need the -n parameter. The data zeroed is not needed because as the index exists, gztool will use it to jump to the index point just before the uncompressed 9009009 byte position, so all previous data is just ignored:

$ gztool -b 9009009 chunk.gz > extracted_chunk.sql

Reading partially downloaded gzip with an offset

Tags:

Gzip

Streams

Related

Recent Posts