Best way to organize bioinformatics projects?

I'm a software specialist embedded in a team of research scientists, though in the earth sciences, not the life sciences. A lot of what you write is familiar to me.

One thing to bear in mind is that much of what you have learned in your studies is about engineering software for continued use. As you have observed a lot of what research scientists do is about one-off use and the engineered approach is not suitable. If you want to implement some aspects of good software engineering you are going to have to pick your battles carefully.

Before you start fighting any battles, you are going to have to critically examine your own ideas to ensure that what you learned in school about general-purpose software engineering is valid for your current situation. Don't assume that it is.

In my case the first battle I picked was the implementation of source code control. It wasn't hard to find examples of all the things that go wrong when you don't have version control in place:

  • some users had dozens of directories each with different versions of the 'same' code, and only the haziest idea of what most of them did that was unique, or why they were there;
  • some users had lost useful modifications by overwriting them and not being able to remember what they had done;
  • it was easy to find situations where people were working on what should have been the same program but were in fact developing incompatibly in different directions;
  • etc etc etc

Once I had gathered the information -- and make sure you keep good notes about who said what and what it cost them -- it became relatively easy to paint a picture of a better world with source code control.

Next, well, next you have to choose your own next battle. But one of the seeds of doubt you have to sow in your scientist-colleagues minds is 'reproducibility'. Scientific experiments are not valid if they are not reproducible; if their experiments involve software (and they always do) then careful software engineering is essential for reproducibility. A lot of this is about data provenance, but that's a topic for another day.


Part of the issue here is the distinction between documentation for software vs documentation for publication.

For software development (and research plan) design, the important documentation is structural and intentional. Thus, modeling the data, reasons why you are doing something, etc. I strongly recommend using the skills you've learned in CS for documenting your research plan. Having a plan for what you want to do gives you a lot of freedom to multi-task while long analyses are running.

On the other hand, a lot of bioinformatics work is analysis. Here, you need to treat documentation like a lab notebook, and not necessarily a project plan. You want to be document what you did, maybe a brief comment why (e.g. when you are troubleshooting data), and what the outputs and results are. What I do is fairly simple. First, I start in a directory and create a git repo. Then, whenever I change some file, I commit it to the repo. As much as possible, I try to name data outputs in a way that I can drop then into my git ignore files. Then, as much as possible, I work on a single terminal session for a project at a time, and when I hit a pause point (like when I've got a set of jobs sent up to the grid, I run 'history |cut -c 8-' and paste that into a lab notes file. I then edit the file to add comments for what I did, and remember, change the git add/commit lines to git checkout (I have a script that does this based on the commit messages). As long as I start it in the right directory, and my external data doesn't go away, this means that I can recreate the entire process later.

For any even slightly complex processing tasks, I write a script to do it, so that my notebook, as much as possible, looks clean. To an approximation, a helper script can be viewed as a subroutine in a larger project, and should be documented internally to at least that level.