Time to zip very large (100G) files

You can change the speed of gzip using --fast --best or -# where # is a number between 1 and 9 (1 is fastest but less compression, 9 is slowest but more compression). By default gzip runs at level 6.


The reason tar takes so little time compared to gzip is that there's very little computational overhead in copying your files into a single file (which is what it does). gzip on the otherhand, is actually using compression algorithms to shrink the tar file.

The problem is that gzip is constrained (as you discovered) to a single thread.

Enter pigz, which can use multiple threads to perform the compression. An example of how to use this would be:

tar -c --use-compress-program=pigz -f tar.file dir_to_zip

There is a nice succint summary of the --use-compress-program option over on a sister site.


I seem to be using a single CPU at approximately 100%.

That implies there isn't an I/O performance issue but that the compression is only using one thread (which will be the case with gzip).

If you manage to achieve the access/agreement needed to get other tools installed, then 7zip also supports multiple threads to take advantage of multi core CPUs, though I'm not sure if that extends to the gzip format as well as its own.

If you are stuck to using just gzip for the time being and have multiple files to compress, you could try compressing them individually - that way you'll use more of that multi-core CPU by running more than one process in parallel. Be careful not to overdo it though because as soon as you get anywhere near the capacity of your I/O subsystem performance will drop off precipitously (to lower than if you were using one process/thread) as the latency of head movements becomes a significant bottleneck.

Tags:

Linux

Gzip