How to add a huge file to an archive and delete it in parallel

An uncompressed tar archive of a single file consists of a header, the file, and a trailing pad. So your principle problem is how to add 512 bytes of header to the start of your file. You can start by creating the wanted result with just the header:

tar cf - bigfile | dd count=1 >bigarchive.tar

Then copy the first 10G of your file. For simplicitly we assume your dd can read/write 1Gib at a time:

dd count=10 bs=1G if=bigfile >>bigarchive.tar

We now deallocate the copied data from the original file:

fallocate --punch-hole -o 0 -l 10GiB bigfile

This replaces the data with sparse zeroes that take no space on the filesystem. Continue in this manner, adding a skip=10 to the next dd, and then incrementing the fallocate starting offset to -o 10GiB. At the very end add some nul characters to pad out the final tar file.


If your filesystem does not support fallocate you can do something similar, but starting at the end of the file. First copy the last 10Gibytes of the file to an intermediate file called, say, part8. Then use the truncate command to reduce the size of the original file. Proceed similarly until you have 8 files each of 10Gibyte. You can then concatenate the header and part1 to bigarchive.tar, then remove part1, and then concatenate part2 and remove it, and so on.


Deleting a file does not necessarily do what you think it does. That's why in UNIX-like systems the system call is called unlink and not delete. From the manual page:

unlink() deletes a name from the filesystem.  If that name was the last
link to a file and no processes have the file open, the file is deleted
and the space it was using is made available for reuse.

If the name was the last link to a file but any processes still have
the file open, the file will remain in existence until  the  last  file
descriptor referring to it is closed.

As a consequence, as long as the data compressor / archiver is reading from the file, that file remains in existence, occupying space in the filesystem.