On-the-fly stream compression that doesn't spill over into hardware resources?

dd reads and writes data one block at a time, and it only ever has one block outstanding. So

valgrind dd if=/dev/zero status=progress of=/dev/null bs=1M

shows that dd uses approximately 1MB of memory. You can play around with the block size, and drop valgrind, to see the effect on dd’s speed.

When you pipe into gzip, dd simply slows down to match gzip’s speed. Its memory usage doesn’t increase, nor does it cause the kernel to store the buffers on disk (the kernel doesn’t know how to do that, except via swap). A broken pipe only happens when one of the ends of the pipe dies; see signal(7) and write(2) for details.

Thus

dd if=... iconv=fullblock bs=1M | gzip -9 > ...

is a safe way to do what you’re after.

When piping, the writing process ends up being blocked by the kernel if the reading process isn’t keeping up. You can see this by running

strace dd if=/dev/zero bs=1M | (sleep 60; cat > /dev/null)

You’ll see that dd reads 1MB, then issues a write() which sits there waiting for one minute while sleep runs. That’s how both sides of a pipe balance out: the kernel blocks writes if the writing process is too fast, and it blocks reads if the reading process is too fast.


Technically you don't even need dd:

gzip < /dev/drive > drive.img.gz

If you do use dd, you should always go with larger than default blocksize like dd bs=1M or suffer the syscall hell (dd's default blocksize is 512 bytes, since it read()s and write()s that's 4096 syscalls per MiB, too much overhead).

gzip -9 uses a LOT more CPU with very little to show for it. If gzip is slowing you down, lower the compression level, or use a different (faster) compression method.

If you're doing file based backups instead of dd images, you could have some logic that decides whether to compress at all or not (there's no point in doing so for various file types). dar (tar alternative`) is one example that has options to do so.

If your free space is ZERO (because it's an SSD that reliably returns zero after TRIM and you ran fstrim and dropped caches) you can also use dd with conv=sparse flag to create an uncompressed, loop-mountable, sparse image that uses zero disk space for the zero areas. Requires the image file to be backed by a filesystem that supports sparse files.

Alternatively for some filesystems there exist programs able to only image the used areas.


There are no negative implications other than performance: the pipe has a buffer, which is usually 64K, and after that, a write to the pipe will simply block until gzip reads some more data.