Reporting data dredging in a study

Yes. I would have a separate section of your paper entitled something like "Further exploratory analysis", report what you did and what you found, and note that until a study has been design to specifically test your hypothesis, it remains a hypothesis, but suggest that it might be an attractive target for further study.


My best advice is to be very upfront about the fact that

1.) You found some relations in your data that were not apart of your original hypotheses you were interested in testing.

2.) These results relations were still interesting enough to share, although the evidence should be taken with a grain of salt.

Because these relations were found spuriously, the evidence is not as strong as if they were the original hypotheses of interest. When writing this in your results, it's important to reflect this.

In my opinion (as a PhD in statistics, for what that's worth), I'd include unadjusted p-values and confidence intervals, and label them as such; "p-value (without adjusting for data exploration): 0.0013". Thus the reader isn't in the dark about your interesting discovery, but also is not misled about the strength of the evidence.

On a pragmatic note, note that this means this previously unhypothesized finding alone is unlikely to be sufficient for publication, as one could make the argument that the strength of evidence for this finding is not particularly strong. But hitching this result onto the published paper seems quite reasonable if that connection has the potential to be interesting other researchers in the field. One of my professors referred to this type of exploratory data analysis as "hypothesis generating" rather than "hypothesis testing".


This should be fine, so long as you're doing appropriate multiple hypothesis correction. Note in your manuscript what types of exploratory variables you evaluated for association, and how many of them there were. If your p-value is still significant after multiple hypothesis correction, that means there's still a stronger association than you'd expect by chance alone, which makes it an interesting variable.

If you only report the interesting variable and don't mention the other 1000 variables you tested, you could be rightly accused of p-hacking, which occurs when someone ignores "researcher degrees of freedom" to inflate the significance of their result. There's nothing inherently wrong with testing exploratory variables, you just have to do it in a responsible manner. Pre-selecting variables of study is essentially just a means of using prior knowledge to get around multiple hypothesis correction.

Obligatory xkcd