Is it dishonest to remove outliers from our data?

Outliers are extreme data points which follow different patterns to the rest of the data; they have been defined as “data points which deviate so much from the other observations that they arouse suspicions that they were generated by a different mechanism” (Hawkins, 1980).

 

Is it dishonest to remove these outliers from our data? My initial reaction was to stay yes, of course it is wrong to remove outliers, they are part of the data set so their presence must have a purpose! Therefore removing these data points and going on to prove your hypothesis would be dishonest because in the absence of the outliers, the results presented aren’t your actual results. However I now believe outliers caused by anything but chance must be removed from the data set.

 Some believe outliers shouldn’t be removed from the data set because as researchers we shouldn’t pick and choose which data points are include in results just because they don’t appear to fit with the rest of the results. In support of this, research has found that when humans have a strong belief about something, we subconsciously seek out belief consistent information and discard information contradicting our beliefs (Festinger, 1957). This is likely to affect our decision to remove outliers; if a researcher has a strong belief in their hypothesis (which is most likely) they may subconsciously remove data that disagrees with their hypothesis because they see them as outliers. In this case, removing outliers would be for the researcher’s personal gain i.e. a false significant result and a published research paper rather trying to finding a true significant result.

There are a number of causes of outliers including pure chance, samples are drawn at random therefore there is bound to be some extreme cases. Another cause of outliers is measurement errors including instrument failure or sampling errors e.g. a sample that is misrepresentative of the population. In my opinion, the most common cause of outliers is participant error, for example missed trials, lack of understanding or lack of concentration; these participant errors could unfairly manipulate the mean and affect our statistics.

The presence of just one outlier in a data set can cause an array of unnecessary problems including skewed data, inflated/deflated means, distorted range and type one and type two errors. Another major reason why outliers need to be removed from data is because they alter our ability to interpret statistical tests. A great majority of statistical tests, such as t-tests, assume a normal distribution therefore if an outlier causes the distribution to become skewed, results of the data may look significant when they are in fact not. In this case it would be dishonest to keep outliers in the data set because conclusions from the data set would in fact be incorrect.

However, set when performing statistical tests outliers don’t always have to be removed from a data set in order to remove their negative effects. Non-parametric tests can be used to analyse data including outliers because they do not assume a normal distribution therefore results will be unaffected by the presence of outliers. Equally, if an average of the data set is needed we can compute the median instead of the mean because outliers have a lesser effect on this.

In conclusion, the researchers must use their discretion when deciding whether to remove outliers from their data. In my opinion the removal of outliers from data isn’t dishonest; outliers which are caused by anything but chance should be removed from the data set. However, because the removal of outliers is a subjective process, researchers should discuss the anomaly in the discussion section of their research project and justify why they have removed the outlier(s). This will prevent data being removed without just cause e.g. to enable researchers to prove their hypothesis.

  • Hawkins, D.M. (1980). Identification of outliers. London: Chapman and Hall.
  • Festinger,L. (1957). A theory of cognitive dissonance. Stanford, CA: Stanford University Press.
Advertisements

4 thoughts on “Is it dishonest to remove outliers from our data?

  1. psucab says:

    Firstly I think this is such a tough topic to talk about as there are no set rules! I completely agree that on the face of it seems completely crazy that it would be acceptable for a researcher to remove any type of data and the idea that it is at the researchers discretion is a mere gateway for unreliable research. Having said this after thinking more about the topic I am completely swayed to agree that so long as one can justify removing the outlier then it is fine to do so. I agree that in order to remove an outlier the researcher must look at the participants data individually, and try to determine whether the participant was simply not very good at the task, or didn’t understand, or wanted to get some sona credits out the way and so wasn’t paying full attention. The latter is rare though I believe. Good blog!

  2. psud63 says:

    I think you did well to include how they occur, why they occur, as well as your argument as to whether or not it is fair to remove outliers. I agree with the point that its not desirable to remove outliers, but sometimes it is essential as it will skew the results. For example if someone in a experiment is just button pressing, and not actually taking part in the experiment – then their results would not be representative and it is acceptable to remove them. Really good blog, well done!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: