How long does one need to keep the original data of a study / publication?

This is a question you should ask the funding agency that is paying for your work. They usually have requirements for how the data that is produced should be handled.

As an example, this is what the main German funding agency DFG says:

Den Regeln der Guten Wissenschaftlichen Praxis folgend sollen Forschungsdaten in der eigenen Einrichtung oder in einer fachlich einschlägigen, überregionalen Infrastruktur für mindestens 10 Jahre archiviert werden.

My rough translation:

Following the rules of good scientific practice research data should be archived in your own facility or in a topically relevant national infrastructure for at least 10 years.

There are certainly different considerations when the research data contains private information, e.g. patient data. I can't say anything about the regulations there, I'm only considering public research data that doesn't have any privacy implications.

In general the scientific community benefits if the data remains available forever. And while the 10 years in my example are a minimum, many parts of the produced data like the publications and depositions of structured data into repositories like crystal structures to the PDB remain archived indefinitely. And many repositories like the PDB are steadily increasing the amount of raw data they would like to get along with the final output.


The question "how long should I keep data from my research" is basically a synonym for "how long should my research be relevant for?" Once your work can no longer be reproduced, its value and ability to be a part of the discussion on its topic is seriously limited.

This doesn't mean that you necessarily need to stash your datasets forever, but I do wonder why anyone would ever choose to enforce the obsolescence of their own research. There are many excellent solutions online that make maintaining it costless and virtually effortless*. Something as simple as putting your code on GitHub and your data on Google Drive, making both public and providing a link to them on your website will take care of the issue for the foreseeable future. Of course if Microsoft or Google ever decide to terminate these (extremely popular) platforms, you would have to make new arrangements, but at least it would all be collated in one place and ready to go elsewhere.

As for the GDPR, it's my understanding that only applies to personal data. In fact if you pull up the Wikipedia entry for it, the "Exceptions" section clearly lists scientific research.

The only other exception I can think of is the case of private or personally-identifiable data, such as health surveys. In this case control of the data was probably (hopefully?) spelled out in the initial proposals, and/or taken before an IRB.

*I am assuming here that your data isn't "big" data. If your work relies on half a terabyte of data, everything changes. But if you work in the field of "big data" you likely have more knowledge of how to work with it already.