Hugely important database replete with errors

There are many critical scientific resources out there that have massive known flaws, but are still useful because the flaws don't prevent people from getting high value from the resources. GenBank, for example, is the predominant source of genetic information in the world and is also known to have many mislabelled sequences.

From what you have written, it is not clear whether something like this is also the case with the resource that you are dealing with. The course I would recommend you take depends on 1) the degree to which the flaws are known and 2) their degree of impact on scientific conclusions derived from the database. The key cases that I see are:

  • The flaws are well-known and researchers are able to work around them: Here, there's nothing you really have to do, and the maintainers are unlikely to be particularly responsive to your complaints, since their system is "good enough" and they likely have other priorities.

  • The flaws are well-known and difficult to work around: This seems the least likely case, as why would people use this database and not the alternative that you mention? If it is so, however, you should probably just finish your thesis and move on: a paper on the flaws isn't interesting if they're already well-known, and while you should report your problems to the maintainers, you'll just be one more instance of the issues they already know about.

  • The flaws are not well-known and likely to cause serious problems in most research: In this case, a publication about the flaws is likely to be of interest and worth doing. It might or might not cause the database to be fixed, but it is likely to be important to alert researchers using the database to the problems in their work, creating pressure for the database to be fixed or people to migrate elsewhere.

  • The flaws are not well-known, but not likely to have a serious impact: This is likely to be the case if you are using the database in a very different way than most others, such that your research is more strongly impacted. In this case, talking about it in your thesis seems sufficient. You should document the problems you had and the flaws you discovered, but you are unlikely to get them corrected because you are not their target market.

Notice that in all of these cases, I assume that the database is unlikely to get fixed. That is because the persistence of the problems over time and the non-professional curation that you report indicate an organization that is probably missing either the resources or the incentives to make the fixes you would like, even in collaboration---although you might turn out to be pleasantly surprised.


You should contact the person above the database team.

The database that you describe seems to be huge and complicated. As such any modification on it can be relatively big and complicated.

I've already worked in a place with an atrocious database. The relying architecture was simply badly designed. Everyone was aware of it, however we decided to leave it that way.

Why? because improving it meant recreating and redesigning the whole thing and then migrating all of the data from the old one to the new one. We estimated that it would take us months, if not years, of work to end up with a nice database.

At the end of the day, it was decided it was both less expensive and less of a hassle to correct by hand all of the errors created by the database than to improve the underlying system.

Which is to say, that there is a possibility that they are aware of a certain number of these problems but choose not to solve them and not to inform the users as it could leave a bad image. Imagine them answering you :

"We know our database has a lot of problems, but we're fine with it. Too bad for your errors."


It is injudicious to describe your own PhD project as boring. It sounds like you are not able to go beyond what is currently known. Well, it takes all sorts. Maybe you can teach what is known to those who don't know later.

Anyway I see no gain in getting into conflict with external groups.
You need to quantify what impact the database errors have on the results of your work and ideally show they are minimal or that you can detect them when they occur. And that should be the most minimal part of your presentation. Using that issue to fill an empty void isn't going to pass muster.