Split equivalent of gzip files in python

I don't believe that split works the way you think it does. It doesn't split the gzip file into smaller gzip files. I.e. you can't call gunzip on the individual files it creates. It literally breaks up the data into smaller chunks and if you want to gunzip it, you have to concatenate all the chunks back together first. So, to emulate the actual behavior with Python, we'd do something like:

infile_name = "file.dat.gz"

chunk = 50*1024*1024 # 50MB

with open(infile_name, 'rb') as infile:
    for n, raw_bytes in enumerate(iter(lambda: infile.read(chunk), b'')):
        print(n, chunk)
        with open('{}.part-{}'.format(infile_name[:-3], n), 'wb') as outfile:
            outfile.write(raw_bytes)

In reality we'd read multiple smaller input chunks to make one output chunk to use less memory.

We might be able to break the file into smaller files that we can individually gunzip, and still make our target size. Using something like a bytesIO stream, we could gunzip the file and gzip it into that memory stream until it was the target size then write it out and start a new bytesIO stream.

With compressed data, you have to measure the size of the output, not the size of the input as we can't predict how well the data will compress.

Tags:

Python

Split

Gzip