Which archive file formats provide recovery protection against file corruption?

Given that a damage to a directory part of any archive could potentially render entire archive useless, your best bet would be to add separate step to your backup process to generate so-called parity files. In case if a data block in original file gets damaged, it can be reconstructed by combining data from the parity file with valid blocks from the original file.

The variable there would be how much damage you'd like to be able to repair from. If you want to protect against a single bit flip, then your parity file will be just 1 bit in size. If you want something in a tune of a disk sector size, then obviously it'll obviously cost you more.

There's a big theory behind this (see Forward Error Correction) and it is widely used in practice. For example, this is how CDs can withstand certain degree of scratching and how cell phones can maintain reasonable call quality over lossy connections.

Long story short, take a look at .par files.


Bup [1] backs up things and automatically adds in parity redundancy, making bit-rot extremely unlikely. Catastrophic disk failure is still a thing, so we can use it with git-annex.

git-annex [2] manages files stored on many repositories, some of which might be stored on your computer, thumb-drives, ssh login, some cloud services or a bup backup repository [3], letting the file data flow pretty much transparently by request or automatically into whichever repository you've set. It is also a crowd funded free and open source software project which was written in Haskell with versions running on many platforms, including linux, mac, windows and android.

[1] https://github.com/bup/bup

[2] http://git-annex.branchable.com/

[3] http://git-annex.branchable.com/special_remotes/bup/


But, does one harddisk failure destroy the whole archive or only one file in the archive?

If there is really no alternative to copying everything as one big archive you probably have to make a decision between using a compressed or uncompressed archive.

The contents of uncompressed archives like tarballs can still be detected with file recovery software even if the archive file itself can not longer be read (e.g. due to a corrupt header).

Using compressed archives can be dangerous because some could refuse to extract files if a checksum error occurs which can be caused even if only one bit of the archive file changes.

Of course one can minimize the risk by not storing hundreds of files into one compressed archive but hundreds of compressed files into one uncompressed archive.

gzip *
tar cf archive.tar *.gz

Though I have never seen lots of gzipped files in a tarball in wildlife before. Only the opposite is popular (i.e. tar.gz files).

Is there any difference between zip and iso files?

ZIP is a (mostly but not necessarily) compressed archive and ISO is a format that indicates raw data copied on a low-level basis from an optical disk into a file. The latter can contain literally everything.