Removing outliers. Would it be called cherrypicking data?

One major pitfall is going into analyzing the data without an a priori protocol to deal with outliers. Time to time, tension rises between analysts and investigators on whether a point should be removed or not. Investigator may wish to keep it because it drives the significant result; analyst may be anxious to keep it because, well, it drives the significant result. Much can be avoided if there is a agreed-upon standard.

Anyhow, here are some suggestions (slightly scrabbled):

Reconcile if it's an instrument or data entry error.
If it's a plausible value but behaving like an outlier, a few methods exist:
- Do not bury plausible extreme values in favor of showing the after-trimming data only.
- Consider reporting with and without it, discuss the differences.
- Use statistics that are more tolerant to extreme values. For instance: Non-parametric statistics or robust estimation methods.
Keep all written record about related changes, make sure date, time, person, decision made, etc. are all recorded.
If correction needs to be made, do not change the raw data. Do so using script/syntax and save that syntax for reference.
Explain all outlier-related data change in the report/manuscript. Reasons and methods should be provided so that your analysis can be replicated.
If there isn't a protocol, set one up now before more instances like this came up.
Do not identify outlier after data analysis results are known. If possible, discuss with your colleagues about the outlier without telling them the hypothesis test results.

The questions in your title and body are different in a rather significant way.

In answer to "Could removing outliers be called cherrypicking data?": yes, of course it could.

In answer to "Would it be called cherrypicking data?", that depends on your justification (and apparent motivation).

To be honest I would generally be very suspicious of any paper which removed data points without a very clear and justifiable reason. It isn't possible to tell whether that's the case for your example, as you haven't given any information on it, and additionally for statistical advice it would be more appropriate to ask this over at Cross-Validated.

There are statistical methods for identifying how much of an outlier a point is (leverage, for example) but they are not in and of themselves sufficient justification for removing a point.

[edit] thanks for adding more information to your question. Firstly, it's great that you haven't started collecting data yet, and well done for thinking about this at this stage. As @penguin-knight suggests, formulate a protocol for data collection and identifying suspect data before you start (a priori). However, do remember that the only defensible reason for removing an outlying data point is because you believe it is inaccurate and reflects a problem with how that sample was generated or quantified.

Secondly, regarding the ANOVA part of your question: one of the assumptions when using an ANOVA is that your residuals are normally distributed. 'Outliers' may indicate that this assumption is incorrect. If this is the case, simply ignoring some points is not the correct way to deal with this. Your simplest options are:

apply a transform to your data, for example a log-transform. There is a lengthier discussion of transforms and ANOVAs here: https://stats.stackexchange.com/questions/75005/non-normal-data-for-two-way-anova-which-transformation-to-choose
use a non-parametric test (which does not rely on the assumption of normally-distributed residuals) - such as the Mann-Whitney U test if you are comparing two groups.

Removing outliers. Would it be called cherrypicking data?

Tags:

Statistics

Research Process

Data

Related

Recent Posts