Should I test my algorithm using a standard dataset that I believe is flawed?

You leave a lot unsaid here, such as how expensive (time, money) it is to run the algorithm on any given dataset and how much effort to create a new one.

Three pieces of advice, though. If it is more or less standard to use the published data, then using that should give you some basis for evaluating your algorithm compared to others. With a new dataset you lose that opportunity.

A new dataset also leaves you open to a charge that the "great" results you got were just an artifact of the dataset you created on your own. You should probably avoid that. At least as the first iteration.

There is no reason that you couldn't use the existing dataset, gather your results, but also, in the same paper, discuss the shortcomings of that dataset. This also gives you an opening to future work and you might even get some advice on how to avoid the second point above.

Ok. Four pieces of advice. If you are new at this (and not a doctoral candidate writing a dissertation) you don't need to push the boundaries too far to make a contribution. You can use the simplest case to set up your case 4. for the future. But if you buck current practice very much you need to justify it.


On the other hand, if you are a doctoral candidate, the significance of your work is normally considered more important than the time you need to put into it and the expectation is that it is novel. But you still need to avoid the charge of tailoring the data to fit the desired results.


In general, I would recommend this option, the third one:

  1. Publish my algorithm using a more robust benchmark that I create on my own, in addition to the flawed benchmark. Many papers test their algorithms on additional custom datasets (though usually simulations) so this is not unusual. However, this would imply I consider the flawed benchmark valid and that would still be the results most people look at.

As also already suggested by the other answers, you can explain in the paper what flaws you have perceived in the dataset, but still test on it anyway. That way, you do not necessarily imply that you consider the flawed benchmark valid, but also avoid "accusations" that you only made up your own dataset in order to find something your algorithm performs well on.

Additionally, it may be interesting to note that pointing out flaws in existing benchmarks with widespread use can be a very useful contribution in and of itself, maybe even the focus of an entire paper, provided that you do it well and thoroughly. See some examples (also from the field of AI):

  • Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents: this paper points out some flaws in how a lot of existing research has flawed evaluations in the "Arcade Learning Environment" (this is in Reinforcement Learning). Importantly, it also proposes solutions.
  • Deep Reinforcement Learning that Matters: another paper which points out flaws in evaluation methodology, again in (deep) Reinforcement Learning, this one is a bit more general than just about the "Arcade Learning Environment".
  • A Detailed Analysis of the KDD Cup 99 Data Set: a much older example, and perhaps more similar to your case in that it points out flaws in a particular dataset commonly used for evaluations, rather than flaws in evaluation methodology.

So, you basically say that standard data batch has a flaw / does not take something into account. And you already have some kind of an implementation you would like to test. It does not work well with standard data because of the flaw. It would be possible for you to create a further data set where your code shines.


I would do the following in the paper:

  • Highlight the initial broad problem you are trying to solve. (Things like real-time rendering, image registration, 3SAT, etc.) It would be best if you could already there indicate what does the standard data set not regard, but what is an inherent part of the problem. (Like, "the absorbed light is not immediately reflected back at the same spot", "the objects are deformed in-between", n=2.)
  • Present your solution, implementation details, and special efforts to take care of the said issue.
  • Test your implementation on the community-acknowledged standard data set (with a notice that it does not fully represent the said issue). Additionally, build and present a better data set that takes in the account the above issue. Ensure the data set is publicly available. You might want to use Zenodo or other public repositories in order to not relay on your university website.

    Absolutely, do test your method on the new data set. This is the actual selling point: "There was a flaw in old data set / way of thinking. Here is a new data set that does not have it. Here is a method that works well with it."

    The best way of comparison is to put an existing method (or methods) on par with your new method using both old and new data sets.

Actually, come up with the new data set and test your approach on it first. If it does not work, you do not need to write the paper.