Outliers are extreme data points which follow different patterns to the rest of the data; they have been defined as “data points which deviate so much from the other observations that they arouse suspicions that they were generated by a different mechanism” (Hawkins, 1980).
Is it dishonest to remove these outliers from our data? My initial reaction was to stay yes, of course it is wrong to remove outliers, they are part of the data set so their presence must have a purpose! Therefore removing these data points and going on to prove your hypothesis would be dishonest because in the absence of the outliers, the results presented aren’t your actual results. However I now believe outliers caused by anything but chance must be removed from the data set.
Some believe outliers shouldn’t be removed from the data set because as researchers we shouldn’t pick and choose which data points are include in results just because they don’t appear to fit with the rest of the results. In support of this, research has found that when humans have a strong belief about something, we subconsciously seek out belief consistent information and discard information contradicting our beliefs (Festinger, 1957). This is likely to affect our decision to remove outliers; if a researcher has a strong belief in their hypothesis (which is most likely) they may subconsciously remove data that disagrees with their hypothesis because they see them as outliers. In this case, removing outliers would be for the researcher’s personal gain i.e. a false significant result and a published research paper rather trying to finding a true significant result.
There are a number of causes of outliers including pure chance, samples are drawn at random therefore there is bound to be some extreme cases. Another cause of outliers is measurement errors including instrument failure or sampling errors e.g. a sample that is misrepresentative of the population. In my opinion, the most common cause of outliers is participant error, for example missed trials, lack of understanding or lack of concentration; these participant errors could unfairly manipulate the mean and affect our statistics.
The presence of just one outlier in a data set can cause an array of unnecessary problems including skewed data, inflated/deflated means, distorted range and type one and type two errors. Another major reason why outliers need to be removed from data is because they alter our ability to interpret statistical tests. A great majority of statistical tests, such as t-tests, assume a normal distribution therefore if an outlier causes the distribution to become skewed, results of the data may look significant when they are in fact not. In this case it would be dishonest to keep outliers in the data set because conclusions from the data set would in fact be incorrect.
However, set when performing statistical tests outliers don’t always have to be removed from a data set in order to remove their negative effects. Non-parametric tests can be used to analyse data including outliers because they do not assume a normal distribution therefore results will be unaffected by the presence of outliers. Equally, if an average of the data set is needed we can compute the median instead of the mean because outliers have a lesser effect on this.
In conclusion, the researchers must use their discretion when deciding whether to remove outliers from their data. In my opinion the removal of outliers from data isn’t dishonest; outliers which are caused by anything but chance should be removed from the data set. However, because the removal of outliers is a subjective process, researchers should discuss the anomaly in the discussion section of their research project and justify why they have removed the outlier(s). This will prevent data being removed without just cause e.g. to enable researchers to prove their hypothesis.
- Hawkins, D.M. (1980). Identification of outliers. London: Chapman and Hall.
- Festinger,L. (1957). A theory of cognitive dissonance. Stanford, CA: Stanford University Press.