What medium should be used for long term, high volume, data storage (archival)?

Short answer

It's impossible to guarantee a long timeframe because of entropy (also called death!). Digital data decay and dies, just like any other thing in the universe. But it can be slowed down.

There's currently no fail-proof and scientifically proven way to guarantee 30+ years of cold data archival. Some projects are aiming to do that, like the Rosetta Disks project of the Long Now museum, although they are still very costly and with a low data density (about 50 MB).

In the meantime, you can use scientifically proven resilient optical mediums for cold storage like Blu-ray Discs HTL type like Panasonic's, or archival grade DVD+R like Verbatim Gold Archival, and keep them in air-tight boxes in a soft spot (avoid high temperature) and out of the light.

Also be REDUNDANT: Make multiple copies of your data (at least 4), and compute hashes to check regularly that everything is alright, and every few years you should rewrite your data on new disks. Also, use a lot of error correcting codes, they will allow you to repair your corrupted data!

Long answer

Why are data corrupted with time? The answer lies in one word: entropy. This is one of the primary and unavoidable force of the universe, which makes systems become less and less ordered in time. Data corruption is exactly that: a disorder in bits order. So in other words, the Universe hates your data.

Fighting entropy is exactly like fighting death: you're not likely to succeed, ever. But, you can find ways to slow death, just like you can slow entropy. You can also trick entropy by repairing the corruptions (in other words: you cannot stop the corruptions, but you can repair after they happen if you took measures beforehand!). Just like anything about life and death, there's no magic bullet, nor one solution for all, and the best solutions require you to directly engage in the digital curation of your data. And even if you do everything correctly, you're not guaranteed to keep your data safe, you only maximize your chances.

Now for the good news: there are now quite efficient ways to keep your data, if you combine good quality storage mediums, and good archival/curation strategies: you should design for failure.

What are good curation strategies? Let's get one thing straight: most of the info you will find will be about backups, not about archival. The issue is that most folks will transfer their knowledge on backups strategies to archival, and thus a lot of myths are now commonly heard. Indeed, storing data for a few years (backup) and storing data for the longest time possible spanning decades at least (archival) are totally different goals, and thus require different tools and strategies.

Luckily, there are quite a lot of research and scientific results, so I advise to refer to those scientific papers rather than on forums or magazines. Here, I will summary some of my readings.

Also, be wary of claims and non independent scientific studies, claiming that such or such storage medium is perfect. Remember the famous BBC Domesday project: «Digital Domesday Book lasts 15 years not 1000». Always double check the studies with really independent papers, and if there's none, always assume the storage medium is not good for archival.

Let's clarify what you are looking for (from your question):

  • Long-term archival: you want to keep copies of your sensible, irreproducible "personal" data. Archiving is fundamentally different than a backup, as well explained here: backups are for dynamic technical data that regularly get updated and thus need to be refreshed into backups (ie, OS, work folders layout, etc.), whereas archives are static data that you would likely write only once and just read from time to time. Archives are for intemporal data, usually personal.

  • Cold storage: you want to avoid maintenance of your archived data as much as possible. This is a BIG constraint, as it means that the medium must use components and a writing methodology that stay stable for a very long time, without any manipulation from your part, and without requiring any connection to a computer or electrical supply.

To ease our analysis, let's first study cold storage solutions, and then long-term archival strategies.

Cold storage mediums

We defined above what a good cold storage medium should be: it should retain data for a long time without any manipulation required (that's why it's called "cold": you can just store it in a closet and you do not need to plug it into a computer to maintain data).

Paper may seem like the most resilient storage medium on earth, because we often find very old manuscript from ancient ages. However, paper suffers from major drawbacks: first, the data density is very low (cannot store more than ~100 KB on a paper, even with tiny characters and computer tools), and it degrades over time without any way to monitor it: paper, just like hard drives, suffer from silent corruption. But whereas you can monitor silent corruptions on digital data, you cannot on paper. For example, you cannot guarantee that a picture will retain the same colors over only a decade: the colors will degrade, and you have no way to find what were the original colors. Of course, you can curate your pictures if you are a pro at image restoration, but this is highly time consuming, whereas with digital data, you can automate this curation and restoration process.

Hard Drives (HDDs) are known to have an average life span of 3 to 8 years: they do not just degrade over time, they are guaranteed to eventually die (ie: inaccessible). The following curves show this tendency for all HDDs to die at a staggering rate:

Bathtub curve showing the evolution of HDD failure rate given the error type (also applicable to any engineered device):

curve-hdd1

Curve showing HDD failure rate, all error types merged: curve-hdd2

Source: Backblaze

You can see that there are 3 types of HDDs relatively to their failure: the rapidly dying ones (eg: manufacturing error, bad quality HDDs, head failure, etc.), the constant dying rate ones (good manufacturing, they die for various "normal" reasons, this is the case for most HDDs), and finally the robust ones that live a bit longer than most of HDDs and eventually die soon after the "normal ones" (eg: lucky HDDs, not-too-much used, ideal environmental conditions, etc..). Thus, you are guaranteed that your HDD will die.

Why HDDs die so often? I mean, the data is written on a magnetic disk, and the magnetic field can last decades before fading away. The reason they die is because the storage medium (magnetic disk) and the reading hardware (electronic board+spinning head) are coupled: they cannot be dissociated, you can't just extract the magnetic disk and read it with another head, because first the electronic board (which convert the physical data into digital) is different for almost each HDD (even of the same brand and reference, it depends on the originating factory), and the internal mechanism with the spinning head is so intricate that nowadays it's impossible for a human to perfectly place a spinning head on magnetic disks without killing them.

In addition, HDDs are known to demagnetize over time if not used (including SSD). Thus, you cannot just store data on a hard disk, store it in a closet and think that it will retain data without any electrical connection: you need to plug your HDD to an electrical source at least once per year or per couples of years. Thus, HDDs are clearly not a good fit for cold storage.

Magnetic tapes: they are often described as the go-to for backups needs, and by extension for archival. The problem with magnetic tapes is that they are VERY sensitive: the magnetic oxide particles can be easily deteriorated by sun, water, air, scratches, demagnetized by time or any electromagnetic device or just fall off with time, or print-through. That's why they are usually used only in datacenters by professionals. Also, it has never been proven that they can retain data more than a decade. So, why are they often advised for backups? Because they used to be cheap: back in the days, it costed 10x to 100x cheaper to use magnetic tapes than HDDs, and HDDs tended to be a lot less stable than now. So magnetic tapes are primarily advised for backups because of cost effectiveness, not because of resiliency, which is what interests us the most when it comes to archiving data.

CompactFlash and Secure Digital (SD) cards are known to be quite sturdy and robust, able to survive catastrophic conditions.

The memory cards in most cameras are virtually indestructible, found Digital Camera Shopper magazine. Five memory card formats survived being boiled, trampled, washed and dunked in coffee or cola.

However, as any other magnetic based medium, it relies on an electrical field to retain the data, and thus if the card runs out of juice, data may get totally lost. Thus, not a perfect fit for cold storage (as you need to occasionally rewrite the whole data on the card to refresh the electrical field), but it can be a good medium for backups and short or medium-term archival.

Optical mediums: Optical mediums are a class of storage mediums relying on laser to read the data, like CD, DVD or Blu-ray (BD). This can be seen as an evolution of paper, but we write the data in a so tiny size, that we needed a more precise and resilient material than paper, and optical disks are just that. The two biggest advantages of optical mediums is that the storage medium is decoupled from the reading hardware (ie, if your DVD reader fails, you can always buy another one to read your disk) and that it's based on laser, which makes it universal and future proof (ie, as long as you know how to make a laser, you can always tweak it to read the bits of an optical disk by emulation, just like CAMILEON did for the Domesday BBC Project).

Like any technology, new iterations not only offer bigger density (storage room), but also better error correction, and better resilient against environmental decay (not always, but generally true). The first debate about DVD reliability was between DVD-R and DVD+R, and even if DVD-R are still common nowadays, DVD+R are recognized to be more reliable and precise. There are now archival grade DVD discs, specifically made for cold storage, claiming that they can withstand a minimum of ~20 years without any maintenance:

Verbatim Gold Archival DVD-R [...] has been rated as the most reliable DVD-R in a thorough long-term stress test by the well regarded German c't magazine (c't 16/2008, pages 116-123) [...] achieving a minimum durability of 18 years and an average durability of 32 to 127 years (at 25C, 50% humidity). No other disc came anywhere close to these values, the second best DVD-R had a minimum durability of only 5 years.

From LinuxTech.net.

Furthermore, some companies specialized in very long term DVD archival and extensively market them, like the M-Disc from Millenniata or the DataTresorDisc, claiming that they can retain data for over 1000 years, and verified by some (non-independent) studies (from 2009) among less-scientific others.

This all seems very promising! Unluckily, there's not enough independent scientific studies to confirm these claims, and the few ones available are not so enthusiastic:

Humidity (80% RH) and temperature (80°C) accelerated ageing on several DVDs over 2000 hours (about 83 days) of test with regular checking of readability of data: Humidity and temperature accelerated ageing on several DVDs brands

Translated from the french institution for digital data archival (Archives de France), study from 2012.

The first graph show DVD with a slow degradation evolution. The second one DVD with rapid degradation curves. And the third one is for special "very long-term" DVDs like M-Disc and DataTresorDisc. As we can see, their performance does not quite fit the claims, being lower or on par with standard, non archival grade DVDs!

However, inorganic optical discs such as M-Disc and DataTresorDisc get one advantage: they are quite insensible to light degradation:

Accelerated ageing using light (750 W/m²) during 240 hours: Light accelerated ageing on several DVDs brands

These are great results, but an archival grade DVD such as the Verbatim Gold Archival also achieves the same performance, and furthermore, light is the most controllable parameter for an object: it's quite easy to put DVD in a closed box or closet, and thus removing any possible impact of light whatsoever. It would be much more useful to get a DVD that is very resilient to temperature and humidity than light.

This same research team also studied the Blu-ray market to see if there would be any brand with a good medium for long term cold storage. Here's their finding:

Humidity and temperature accelerated ageing on several Blu-ray brands, under the same parameters as for DVDs: temp-bd

Light accelerated ageing on several BluRays brands, same parameters: light-bd

Translated from this study of Archives de France, 2012.

Two summaries of all findings (in french) here and here.

In fine, the best Blu-ray disc (from Panasonic) performed similarly to the best archival grade DVD in humidity+temperature test, while being virtually insensible to light! And this Blu-ray disc isn't even archival grade. Furthermore, Blu-ray discs use an enhanced error correcting code than DVDs (themselves using an enhanced version relatively to CDs), which further minimizes the risks of losing data. Thus, it seems that some BluRay discs may be a very good choice for cold storage.

And indeed, some companies are starting to work on archival grade, high density storage Blu-ray discs like Panasonic and Sony, announcing that they will be able to offer 300 GB to 1TB of storage with an average life span of 50 years. Also, big companies are turning themselves towards optical mediums for cold storage (because it consumes a lot less resources since you can cold store them without any electrical supply), such as Facebook which developed a robotic system to use Blu-ray discs as "cold storage" for data their system rarely access.

Long Now archival initiative: There are other interesting leads such as the Rosetta Disc project by the Long Now museum, which is a project to write microscopically scaled pages of the Genesis in every languages on earth the Genesis got translated to. This is a great project, which is the first to offer a medium that allows to store 50 MB for really very long term cold storage (since it's written in carbon), and with future-proof access since you only need a magnifier to access the data (no weird format specifications nor technological hassle to handle such as the violet beam of the Blu-ray , just need a magnifier!). However, these are still manually made and thus estimated to cost about $20K, which is a bit too much for a personal archival scheme I guess.

Internet-based solutions: Yet another medium to cold store your data is over the net. However, cloud backup solutions are not a good fit, for the primary concern than the cloud hosting companies may not live as long as you would like to keep your data. Other reasons include the fact that it is horribly slow to backup (since it transfers via internet) and most providers require that the files also exist on your system to keep them online. For example, both CrashPlan and Backblaze will permanently delete files that are not at least seen once on your computer in the last 30 days, so if you want to upload backup data that you store only on external hard drives, you will have to plug your USB HDD at least once per month and sync with your cloud to reset the countdown. However, some cloud services offer to keep your files indefinitely (as long as you pay of course) without a countdown, such as SpiderOak. So be very careful of the conditions and usage of the cloud based backup solution you choose.

An alternative to cloud backup providers is to rent your own private server online, and if possible, choose one with automatic mirroring/backup of your data in case of hardware failure on their side (a few ones even guarantee you against data lost in their contracts, but of course it's more expensive). This is a great solution, first because you still own your data, and secondly because you won't have to manage the hardware's failures, this is the responsibility of your host. And if one day your host goes out of business, you can still get your data back (choose a serious host so that they don't shutdown over the night but notify you beforehand, maybe you can ask to put that onto the contract), and rehost elsewhere.

If you don't want to hassle of setting up your own private online server, and if you can afford it, Amazon offers a new data archiving service, called Glacier. The purpose is exactly to cold store your data for the long-term. It offers 11 9s of durability per year per archive which is the same as their other S3 offers, but at a much lower price. The catch is that the retrieval isn't free and can take anywhere from a few minutes (Standard retrieval from Glacier Archive) to 48 hours (Bulk retrieval from Glacier Deep Archive).

Shortcomings of cold storage: However, there is a big flaw in any cold storage medium: there's no integrity checking, because cold storage mediums CANNOT automatically check the integrity of the data (they can merely implement error correcting schemes to "heal" a bit of the damage after corruption happened, but it cannot be prevented nor automatically managed!) because, contrariwise to a computer, there's no processing unit to compute/journalize/check and correct the filesystem. Whereas with a computer and multiple storage units, you could automatically check the integrity of your archives and automatically mirror onto another unit if necessary if some corruption happened in an data archive (as long as you have multiple copies of the same archive).

Long-Term Archival

Even with the best currently available technologies, digital data can only be cold stored for a few decades (about 20 years). Thus, in the long run, you cannot just rely on cold storage: you need to setup a methodology for your data archiving process to ensure that your data can be retrieved in the future (even with technological changes), and that you minimize the risks of losing your data. In other words, you need to become the digital curator of your data, repairing corruptions when they happen and recreate new copies when needed.

There's no foolproof rules, but here are a few established curating strategies, and in particular a magical tool that will make your job easier:

  • Redundancy/replication principle: Redundancy is the only tool that can revert the effects of entropy, which is a principle based on information theory. To keep data, you need to duplicate this data. Error codes are exactly an automatic application of the redundancy principle. However, you also need to ensure that your data is redundant: multiple copies of the same data on different discs, multiple copies on different mediums (so that if one medium fails because of intrinsic problems, there's little chances that the others on different mediums would also fail at the same time), etc. In particular, you should always have at least 3 copies of your data, also called 3-modular redundancy in engineering, so that if your copies become corrupted, you can cast a simple majority vote to repair your files from your 3 copies. Always remember the sailor's compass advice:

It is useless to bring two compasses, because if one goes wrong, you can never know which one is correct, or if both are wrong. Always take one compass, or more than three.

  • Error correcting codes: this is the magical tool that will make your life easier and your data safer. Error correcting codes (ECCs) are a mathematical construct that will generate data that can be used to repair your data. This is more efficient, because ECCs can repair a lot more of your data using a lot less of the storage space than simple replication (ie, making multiple copies of your files), and they can even be used to check if your file has any corruption, and even locate where are those corruptions. In fact, this is exactly an application of the redundancy principle, but in a cleverer way than replication. This technique is extensively used in any long range communication nowadays, such as 4G, WiMax, and even NASA's space communications. Unluckily, although ECCs are omnipresent in telecommunications, they are not in file repair, maybe because it's a bit complex. However, some software are available, such as the well-known (but now old) PAR2, DVD Disaster (which offers to add error correction codes on optical disks) and pyFileFixity (which I develop in part to overcome PAR2 limitations and issues). There are also file systems that optionally implement Reed-Solomon such as ZFS for Linux or ReFS for Windows, which are technically a generalization of RAID5.

  • Check the integrity of your files regularly: Hash your files, and check them from time to time (ie, once per year, but it depends on the storage medium and environmental conditions). When you see that your files suffered of corruption, it's time to repair using the ECCs you generated if you have done so, and/or to make a new fresh copy of your data on a new storage medium. Checking data, repairing corruption and making new fresh copies is a very good curation cycle which will ensure that your data is safe. Checking in particular is very important because your files copies can get silently corrupted, and if you then copy the copies that have been tampered, you will end up with totally corrupted files. This is even more important with cold storage mediums, such as optical disks, which CANNOT automatically check the integrity of the data (they already implement ECCs to heal a bit, but they cannot check nor create new fresh copies automatically, that's your job!). To monitor files changes, you can use the rfigc.py script of pyFileFixity or other UNIX tools such as md5deep. You can also check the health status of some storage mediums like hard drives using tools such as Hard Drive Sentinel or the open source smartmontools.

  • Store your archives mediums on different locations (with at least one copy outside of your house!) to avoid for real life catastrophic events like flood or fire. For example, one optical disc at your work, or a cloud-based backup can be a good idea to meed this requirement (even if cloud providers can be shut down at any moment, as long as you have other copies, you will be safe, the cloud providers will only serve as an offsite archive in case of emergency).

  • Store in specific containers with controlled environmental parameters: for optical mediums, store away from light and in a water-tight box to avoid humidity. For hard drives and sd cards, store in anti-magnetic sleeves to avoid residual electricity to tamper the drive. You can also store in air-tight and water-tight bag/box and store in a freezer: slow temperatures will slow entropy, and you can extend quite a lot the life duration of any storage medium like that (just make sure that water won't enter inside, else your medium will die quickly).

  • Use good quality hardware and check them beforehand (eg: when you buy a SD card, test the whole card with software such as HDD Scan to check that everything is alright before writing your data). This is particularly important for optical drives, because their quality can drastically change the quality of your burnt discs, as demonstrated by the Archives de France study (a bad DVD burner will produce DVDs that will last a lot less).

  • Choose carefully your file formats: not all files formats are resilient against corruption, some are even clearly weak. For example, .jpg images can be totally broken and unreadable by tampering only one or two bytes. Same for 7zip archives. This is ridiculous, so be careful about the file format of the files you archive. As a rule of thumb, simple clear text is the best, but if you need to compress, use non-solid zip and for images, use JPEG2 (not open-source yet...). More info and reviews of pro digital curators here, here, and here.

  • Store alongside your data archives every software and specifications that are needed to read the data. Remember that specifications change rapidly, and thus in the future your data may not be readable anymore, even if you can access the file. Thus, you should prefer open source formats and software, and store the program's source code along your data so that you can always adapt the program from source code to launch from a new OS or computer.

  • Lots of other methods and approaches are available here, here and in various parts of the Internet.

Conclusion

I advise to use what you can have, but always respect the redundancy principle (make 4 copies!), and always check regularly the integrity (so you need to pre-generate a database of MD5/SHA1 hashes beforehand), and create fresh new copies in case of corruption. If you do that, you can technically keep your data for as long as you want whatever your storage medium is. The time between each check depends on the reliability of your storage mediums: if it's a floppy disk, check every 2 months, if it's a Blu-ray HTL, check every 2/3 years.

Now in the optimal, I advise for cold storage to use Blu-ray HTL discs or archival grade DVD discs stored in water-tight opaque boxes and stored in a fresh place. In addition, you can use SD cards and cloud-based providers such as SpiderOak to store the redundant copies of your data, or even hard drives if it's more accessible to you.

Use lots of error correcting codes, they will save your day. Also you can make multiple copies of these ECCs files (but multiple copies of your data is more important than multiple copies of ECCs because ECCs files can repair themselves!).

These strategies can all be implemented using the set of tools I am developing (open source): pyFileFixity. This tool was in fact started by this discussion, after finding that there were no free tool to completely manage file fixity. Also, please refer to the project's readme and wiki for more info on file fixity and digital curation.

On a final note, I really do hope that more R&D will be put on this problem. This is a major issue for our current society, having more and more data digitized, but without any guarantee that this mass of information will survive more than a few years. That's quite depressing, and I really do think that this issue should be put a lot more on the front, so that this becomes a marketing point for constructors and companies to make storage devices that can last for future generations.

/EDIT: read below for a practical curation routine.


Paper

Other than archival ink on archival paper in sealed storage, no current medium is proven to last an average 100 years without any sort of maintenance.

Archival Paper

Older papers were made from materials such as linen and hemp, and so are naturally alkaline. or acid free, therefore lasting hundreds of years. 20th century paper and most modern paper is usually made from wood pulp, which is often acidic and does not keep for long periods.

Archival Inks

These permanent, non-fading inks are resistant to light, heat and water, and contain no impurities that can affect the permanence of paper or photographic materials. Black Actinic Inks are chemically stable and feature an inorganic pigment that has no tendency to absorb impurities like other ink pigments can.

Redundant storage

Torvalds once said

Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it

Which suggests you should not rely on a single copy on a single medium.

Not magnetic media?

http://www.zdnet.com/blog/perlow/the-bell-tolls-for-your-magnetic-media/9364?tag=content;siu-container

  • Typical example of irretrievable degradation of magnetic media.
  • Issues of hardware and software (and data formats)

Not specialized systems

In 2002, there were great fears that the discs would become unreadable as computers capable of reading the format had become rare and drives capable of accessing the discs even rarer. Aside from the difficulty of emulating the original code, a major issue was that the still images had been stored on the laserdisc as single-frame analogue video,

http://en.wikipedia.org/wiki/BBC_Domesday_Project#Preservation

Long Term Personal storage

http://www.zdnet.com/blog/storage/long-term-personal-data-storage/376

  • both the media AND the format can become unreadable.
  • print on acid-free paper with pigment inks and store in a cool, dry and dark place.
  • The first problem is picking data formats for maximum longevity.
  • Avoid using proprietary formats
  • USCSF is transferring all their original tapes - many in now-obsolete formats like BetaSP and VHS - to the 75Mbit motionJPEG2000 format

Quick follow-up on my previous answer above, this will be made more concise and extended with additional (but not of primary importance) information and references that I cannot add in the first answer because of the 30K length constraints.

Since long-term archival is a curation process, here are some other things you might want to pay attention to make your process more efficient and less time (and resources) consuming:

  • Deduplication: since the only way to ensure long-term archival is through deliberately designed redundancy, you want to avoid useless redundant data (eg, copies of files you fetched from your usb key to your archival hard drive, but you already have a copy coming from your main computer!). Unwanted redundant data, which are usually called duplicates are bad, both in storage cost (they take more storage resource but you will have a hard time finding them when needed), for your process (what if you have different versions of the same file? How can you know which copy is the correct one?) and for your time (it will adds up on the transfert times when you will synchronize the backup to all your archives). That's why professional archival services usually offer automated deduplication: files that are exactly similar will get the same inode, and they won't take any additional space. That's what SpiderOak does for example. There are automated tools you can use, and ZFS (Linux) or ReFS (Windows) filesystems can do it automatically for you.

  • Prioritization/categorization: as you can see, long term archival is a time consuming process that needs to be regularly conducted (to sanity check, synchronize archives across mediums, make new archives on new mediums to replace dying ones, repair files using error correcting codes, etc.). To minimize the time it costs you, try to define different schemes of protection depending on the priority of your data based on categories. The idea is that when you move your computer data to one of your external hard drive you use for long term archival, you place them directly in one folder defining the backup priority: "unimportant", "personal", "important", "critical". Then you can define different backup strategies for each folder: reserve the full protection (eg, backup on 3 hard drives + cloud + error correcting codes + BluRays) only for the most critical data you want to keep your whole life (the critical folder), then a medium protection for "important" data (eg, backup on 3 hard drives + cloud) and then "personal" are just copied to at least two external hard drives, and "unimportant" gets no copy (or maybe on one hard drive if the synchronization isn't too long...). Usually, you will see that "unimportant" will contain most data, then "personal" less, then "important" much less and "critical" will be quite tiny (less than 50 GB for me). For example, in "critical" you will put your house contract and your marriage and childbirths pictures. Then in "important" will be documents you don't want to lose like legal documents, some important photos and videos of memorable events, etc. In "personal" you'll put all your personal photos, videos from your holidays and work documents, these are documents and medias that you'd like to keep, but you won't die of regret if you lose them (and that's good because usually this folder is HUGE so you will probably lose some files in the long run...). "Unimportant" is all the stuff you download from internet or various files and medias you got that you don't really care about (like softwares and games and movies). The bottom line is that: the more files you want to long term archive, the harder (and time consuming) it will be, so try to keep the files that get this special treatment to a minimum.

  • Meta-data is a critical spot: even with good curation strategies, there is usually one thing that isn't protected: the meta-data. Meta-data includes the information about your files, for example: the directory tree (yep, this is only a few bytes, if you lose that, you get your files in total disorder!), the filename and extension, the timestamp (this may be important to you), etc. This might not seem a big deal, but imagine the following: what if tomorrow, all your files (including files shipped with softwares and stuff) are put all inside one flat folder, without their filename nor extension. Will you be able to recover the files you need from the billions of files on your computer, by manual inspection? Don't think this is an unusual scenario, it may happen as easily as if you get a power outtage or a crash in the middle of a copy: the partition being written can become totally destroyed (the infamous type RAW). To overcome this issue, you should be prepared and prepare your data for data recovery: to ensure that you keep the meta-data, you can agglomerate the files with their meta-data using non-solid archives such as ZIP DEFLATE or DAR (but not tar). Some filesystems offer automated meta-data redundancy, such as DVDisaster (for optical discs) and ZFS/ReFS (for hard drives). Then in case of a meta-data crash, you can try to recover your partitions using TestDisk or GetDataBack (allow partial directory tree recovery) or ISOBuster (for optical discs), to recover the directory tree and other meta-data. In case this all fails, you can fallback to filescraping using PhotoRec: this will extract all files it recognizes but in total disorder and without the filename nor timestamp, only the data itself will be recovered. If you zipped important files, you will be able to recover the meta-data inside the zip (even if the zip itself contains no meta-data any longer, at least inside the files will still possess the correct meta-data). However, you will have to manually check all the filescraped files one by one manually, which is time consuming. To safeguard against this possibility, you can generate beforehand an integrity checksum file using pyFileFixity or PAR2, and then use this integrity checksum file after filescraping to automatically recognize and rename the files depending on their content (this is the only way to automate filescraping meta-data recovery, because filescraping can technically only recover content, not the meta-data).

  • Test your file formats and curation strategies for yourself: instead of trusting the words of articles about which format type is better than the other one, you can try by yourself with pyFileFixity filetamper.py or just by yourself by replacing a few hexadecimal characters in some files: you will see that most file formats can break down with as few as a 3 different bytes. So you really ought to choose carefully your file formats: prefer simple text files for notes, and use resilient file formats for medias (they are still being worked on such as MPEG-4 Variable Error Correcting Code, ffmpeg implements it, ref will be added), or generate your own error correcting codes.

  • Read statistical studies, don't believe claims: As I said in the previous answer, extravagant claims are made all the time about the longevity of storage mediums without any scientific fact, and you should be particularly wary about that. Indeed, there is nothing in the law that prevent the manufacturer from boasting about fake, and unverifiable, claims on longevity. Prefer to refer to statistical studies, such as BackBlaze's annual report on hard drives failures rates.

  • Take long guaranteed storage medium. A guarantee cannot bring your data back, but it tells you about how the producer evaluates the failure rate of its product (because else it would cost too much if the rate is too high during the guarantee period).


An update on the scheme I use: I apply the prioritization strategy described above, and I added the cloud backup service SpiderOak to my scheme, because it has a plan with infinite storage and it's totally encrypted, so I retain sole ownership of my data. I do NOT use as my sole backup medium for any of my data, it's only an additional layer.

So here's my current scheme:

  • 3 hard drives copies regularly checked and synchronized and stored in two different places and 1 that is always on me (I use it to store garbage and to do quick backups).
  • SpiderOak with infinite storage plan
  • BluRay discs for really really sensible data but not too big (I limit to 50 GB the data that I can store on these discs)
  • pyFileFixity and DVDisaster for folders I really want to ensure to keep in the long run.

My daily routine is like this: I always have one 2.5 portable USB HDD that I can use to stash unimportant stuff (moving files out of my computer to the HDD) or to backup important stuff (copy files to HDD but keep a copy on my computer). For really critical stuff, I additionally activate the online backup to SpiderOak (I have a folder on my computer with critical stuff, so I just need to move critical files there and it gets synchronized by SpiderOak automatically). For REALLY critical files, I also compute an error correction file using pyFileFixity.

So to summary, for critical stuff, I store them on: the portable HDD, SpiderOak cloud and my computer, so I have 3 copies at any time with just two quick actions (copy to portable HDD and move to SpiderOak folder). If one copy gets corrupted, I can do a majority vote to fix them using pyFileFixity. It's a very low cost scheme (both in price and time) but very efficient and implements all the core tenets of digital curation (triple redundancy, different copies in different locations, different mediums, integrity check and ecc by SpiderOak).

Then, every 3 to 6 months, I synchronize my portable HDD to my second HDD at home, and then every 6 to 12 months I synchronize my portable HDD to my third HDD which is at another house. This provide the additional benefit of rotation (if in 6 months I realize something went wrong in my last backup and I deleted critical files, I can get them from one of the two home HDDs).

Finally, I wrote some very critical files on BluRay discs using DVDisaster (and additional ecc files with pyFileFixity but I'm not sure that was necessary). I store them in an air-tight box in a closet. I only check them every few years.

So you see, my scheme is not really a big burden: on a daily basis, it takes a few minutes to copy files to portable HDD and to my SpiderOak folder, and then I just synchronize every 6 months to one or the other home HDD. This can take up to a day depending on how much data needs to be synchronized, but it's automated by softwares, so you just have to let a computer run the software and you do something else (I use a 100$ netbook I bought just to do that, so I can work on my main computer at the same time without worrying about crashing my computer in the middle of a copy which can be dreadful and destroy your hard drive that is being written). The error correction codes and the BluRay schemes are only used rarely for really critical data, so it's a bit more time consuming, but it's rare.

This scheme can be enhanced (as always), for example by using ZFS/ReFS on the hard drives: this would implement an automated Reed-Solomon error correction code protection and integrity check (and dittoblocks!) without any manual interaction on my part (contrary to pyFileFixity). Although ZFS cannot run under Windows OSes (for the moment), there is ReFS which allows similar error correction control at the filesystem level. Also, it could be a good idea to use these filesystems on external HDDs! A portable HDD running ZFS/ReFS with automated RS error correction and deduplication should be awesome! (and ZFS seems to be quite fast so copy should be quick!).

One last note: be careful of claims about ECC capabilities of filesystems such as in this list, because for most it is limited to only the metadata (such as APFS) or to RAID 1 mirroring (btrfs). To my knowledge, only ZFS and ReFS provide real error correction codes (and not simple mirroring) of both metadata and data, with ZFS being the most advanced currently (although still somewhat experimental as of 2018), in particular because ReFS drives cannot be bootable.

/UPDATE 2020: There are new solutions that are emerging, they are still in early experimental phase, are using a decentralized approach often based on immutable blockchains, and are very interesting to explore although probably not usable right now for most of them (I would not rely on them to backup critical data, but they could be used as secondary backup if you feel adventurous):

  • Perkeep (comparison with other softwares). A similar project is Upspin. Both are actively developed as of early 2020.
  • Sia
  • Syncthing can facilitate backups mirroring between multiple devices, it's free and opensource
  • libchop for developers
  • bitdust (rebuilding not yet ready so be careful!)