Compress a large number of large files fast

The first step is to figure out what the bottleneck is: is it disk I/O, network I/O, or CPU?

If the bottleneck is the disk I/O, there isn't much you can do. Make sure that the disks don't serve many parallel requests as that can only decrease performance.

If the bottleneck is the network I/O, run the compression process on the machine where the files are stored: running it on a machine with a beefier CPU only helps if the CPU is the bottleneck.

If the bottleneck is the CPU, then the first thing to consider is using a faster compression algorithm. Bzip2 isn't necessarily a bad choice — its main weakness is decompression speed — but you could use gzip and sacrifice some size for compression speed, or try out other formats such as lzop or lzma. You might also tune the compression level: bzip2 defaults to -9 (maximum block size, so maximum compression, but also longest compression time); set the environment variable BZIP2 to a value like -3 to try compression level 3. This thread and this thread discuss common compression algorithms; in particular this blog post cited by derobert gives some benchmarks which suggest that gzip -9 or bzip2 with a low level might be a good compromise compared to bzip2 -9. This other benchmark which also includes lzma (the algorithm of 7zip, so you might use 7z instead of tar --lzma) suggests that lzma at a low level can reach the bzip2 compression ratio faster. Just about any choice other than bzip2 will improve decompression time. Keep in mind that the compression ratio depends on the data, and the compression speed depends on the version of the compression program, on how it was compiled, and on the CPU it's executed on.

Another option if the bottleneck is the CPU and you have multiple cores is to parallelize the compression. There are two ways to do that. One that works with any compression algorithm is to compress the files separately (either individually or in a few groups) and use parallel to run the archiving/compression commands in parallel. This may reduce the compression ratio but increases the speed of retrieval of an individual file and works with any tool. The other approach is to use a parallel implementation of the compression tool; this thread lists several.

You can install pigz, parallel gzip, and use tar with the multi-threaded compression. Like:

tar -I pigz -cf file.tar.gz *

Where the -I option is:

-I, --use-compress-program PROG
  filter through PROG

Of course, if your NAS doesn't have multiple cores / powerful CPU, you are limited anyway by the CPU power.

The speed of the hard-disk/array on which the VM and the compression is running can be a bottleneck also.

By far the fastest and most effective way of compressing data is to generate less of it.

What kinds of logs are you generating? 200GB daily sounds like quite a lot (unless you're google or some ISP...), consider that 1MB of text is about 500 pages, so you're generating the equivalent of 100 million pages of text per day, you'll fill the library of congress in a week.

See over your log data if you can reduce it somehow and still get what you need from the logs. For example by turning down the log level or using a terser log format. Or if you are using the logs for statistics, process the statistics on-the-fly and dump a file with the summary and then filter the logs prior to compression for storage.

Compress a large number of large files fast

By far the fastest and most effective way of compressing data is to generate less of it.

Tags:

Optimization

Tar

Bzip2

Shell Script

Related

Recent Posts