How can I work with Gzip files which contain extra data?

I had a similar problem in the past. I wrote a new module that works better with streams. You can try that out and see if it works for you.

This is a bug. The quality of the gzip module in Python falls far short of the quality that should be required in the Python standard library.

The problem here is that the gzip module assumes that the file is a stream of gzip-format files. At the end of the compressed data, it starts from scratch, expecting a new gzip header; if it doesn't find one, it raises an exception. This is wrong.

Of course, it is valid to concatenate two gzip files, eg:

echo testing > test.txt
gzip test.txt
cat test.txt.gz test.txt.gz > test2.txt.gz
zcat test2.txt.gz
# testing
# testing

The gzip module's error is that it should not raise an exception if there's no gzip header the second time around; it should simply end the file. It should only raise an exception if there's no header the first time.

There's no clean workaround without modifying the gzip module directly; if you want to do that, look at the bottom of the _read method. It should set another flag, eg. reading_second_block, to tell _read_gzip_header to raise EOFError instead of IOError.

There are other bugs in this module. For example, it seeks unnecessarily, causing it to fail on nonseekable streams, such as network sockets. This gives me very little confidence in this module: a developer who doesn't know that gzip needs to function without seeking is badly unqualified to implement it for the Python standard library.

How can I work with Gzip files which contain extra data?

Tags:

Python

Gzip

Related

Recent Posts