Sharing research-related codes and datasets: Split them, or share them together on a single platform?

Each platform has its uses, so it's best to use both GitHub for your code and a scientific data archival repository, such as Zenodo, for your code and data.

GitHub allows you to share your code in a manner that encourages collaboration through stars, issues, forks, pull requests, and notifications. So definitely put a copy of your code there. However, GitHub isn't designed to ensure the permanence of the hosted repositories. For example, one can always entirely remove a repository through GitHub's interface and also delete or modify old code with a Git force push. In the longer term, we can't know that the company controlling GitHub (it used to be GitHub, now it's Microsoft) will continue to operate it in a way that serves science. Consider how many corporate products and services have disappeared or changed tack over the past twenty years, even things coming out of large established companies, such as Google (check this list) or Sun Microsystems after it got acquired by Oracle (Java and Solaris licensing).

Consequently, for a better guarantee of your work's permanence and availability to future scientists you should also put your data and code on a scientific data repository, such as Zenodo. This will associate a DOI with your deposited artifacts, which will permanently link to the corresponding version. (Once you upload something to Zenodo, there's almost no possibility to undo the action.) GitHub and Zenodo even offer you the ability to archive a specific GitHub version to Zenodo.

If the volume of the data is large or it is likely to be often used on its own, then split it from the code, and upload each on Zenodo as a separate dataset. Zenodo allows you to link the two (and also the GitHub repository) with the "Related Identifiers" metadata. You can use the Compiles and Compiled by identifiers for linking code with data, and the Supplement to identifier to link your GitHub repository.

For an example of a split associated with a large dataset consider this dataset regarding the lifetime of code lines (3.7GB), the associated software, and the software's GitHub repository. For example of a split associated with an independently usable dataset consider this list of repositories created by enterprises and the associated replication package.

If the volume of the data is modest, it's fine (or perhaps better) to combine the two in a single Zenodo upload, as is the case in this package regarding the completion of Wikipedia links.


As a user I'd definitely like to get both at the same place. But if there are clear linkages between the two sites, I will not give up just because I have to click a couple more times. So, on that front, not very different for me.

As a creator, I'd first determine if the codes are essential for people who will be using the data. For example, codes for data processing and import, variable management and labeling, missing imputation, and scale/index building should reasonably be with the data set. Other more peripheral codes, like that regression analysis done for the second manuscript, is probably better to stay with the publication. Inside the code it's easy to link back to where the data are hosted, so linkage should be preserved.

Another feature I'll pay attention is if the repository allows you to feature publications generated from the data. If they do, then it'd make sense to put the analysis code there. If they don't, then I'd keep them separate or that analysis would feel very out of context.

The sense of permanency may be a factor as well. For repository I'd probably only submit the data expecting nothing will be updated. If an analysis is still ongoing and codes will be actively updated and released, I'd consider something more dynamic like GitHub.

Also, check where your colleagues or people in your field would go. Those should be safe choices.


Code should always be on GitHub (or other version control platforms). A lot of researchers don't put much effort into good programming practices, but learning to use version control properly is both best practice for developing your project, and best practice for sharing the code with others. It also happens to be the best way to handle collaboration with co-authors.

For data that isn't "big", I would rank your options this way:

  • Good: Put your data in any sort of data repository, even if it's a public Google Drive
  • Better: Synch your data with your code on GitHub
  • Best: Your code handles data retrieval as well as wrangling and analysis

The last option is the best because then you can host your data wherever you like, and just update the retrieval part of your code if you need to move it, and then you only need to distribute links to your code. Synching your code to your GitHub repo works just fine if it's not a very large dataset; otherwise this becomes infeasible.

If your data is "big", you probably want to look at something like S3.