The referee says that I did not attempt to "reject" my model – what does this mean?

Of course, it’s hard to be any certain without intimately knowing your work and you have to judge whether the referee’s request is any reasonable, but:

This sounds like a request for Popperian falsifiability: The referee wants you to perform an experiment, analysis, or similar that – given your a priori knowledge – could reject the model.

Obviously my null hypothesis is that my model is wrong […]

I do not find this so obvious. In many cases, a model being wrong does not pose a feasible null hypothesis, because it would have to include every alternative model. There are exceptions to this such as when your claim is that some model has predictive power and you can investigate the null hypothesis that its predictions are as good as chance. But even then, your null hypothesis is something different from “my model is wrong”.

Details: the null hypothesis is that the model does not significantly explain the data.

That does not sound like a null hypothesis. First, the word significant (in the sense of statistical testing) does not make sense within the hypothesis as it is a property of the data with respect to the null hypothesis. Second, what does your null population look like? For any given model, there is an infinity of models that are an even worse description of the data. Are they present in your null population?


It's been many years since I've been in Academia, so I may be wide of the mark, but my interpretation of what the referee is saying – loosely speaking – is that your experimental data is "too easy" on your hypothesis. While your experiments tend to suggest the hypothesis might be true ("results tried to confirm [...] the model"), they do not (in the opinion of the referee) sufficiently "stress" the hypothesis by including "difficult" data.

Taking an analogy with software engineering, it is common practice to perform unit testing on any new block of code. While it is important to include tests that represent "normal" input data (that "[try] to confirm [...] the model"), it is also important to include "edge cases" and "difficult values" that are essentially chosen to try and "break" the new code ("attempt to reject the model").

For a simplistic example, consider a function add( param1, param2 ) designed to add two numbers together. A simple, if slightly naive, set of tests might check that add(1,1)==2, add(2,2)==4, add(5,5)==10 and add(10,10)==20. Passing these tests gives some evidence that the function is working correctly, but they don't really stress the code. Suppose that instead of writing result = param1 + param2 in the body of the function, you had result = param1 + param1 (either through "finger trouble" when first creating it, or a later search-and-replace that changed more than intended). Those tests will still pass, but they won't detect that the function is merely doubling the first parameter.

In conclusion, I believe the referee is saying that while the experimental data you've included does tend to support your hypothesis, you've not chosen a sufficiently wide range of experimental data, and that you cannot claim that the hypothesis appears to be true even in the face of experiments deliberately designed to disprove it.