Outlier Detection &Treatment - A fresh perspective

Missing value imputation and outlier treatment are two most important steps in the preliminary stage of any statistical modeling exercise. We have covered missing value imputation by various method in on of our previous articles.

Let's us try to answer few basic and few advances questions regarding outliers.

All the questions that we are going to deal in this article are:

1.  What are outliers?
2.  Why, at all, do we need to bother about outliers ?
3.  How to detect these outliers?
4.  How do we treat outliers?
5.  and last but not the least ... Is an outlier really an outlier ?


Q1.  What are outliers?

In statistics, an outlier is an observation that is numerically distant from the rest of the data.

Grubbs, defined an outlier as:
An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.

What we define outlier as :
(A few)Values that are altogether different from rest of the observation and the reason for being distinct is not known.




Q2.  Why, at all, do we need to bother about outliers ?
Outliers might mislead analysts to altogether different insight regarding a data.

From a statistician point of view, it distorts the range, mean and standard deviation of the data. It also gives a wrong trend in the data as demonstrated below :

The solid red line is the actual trend line for the data cluster, but due to presence of outlier point (Xn ,Yn) , the trend line gets distorted to dashed line.

There actually exists a positive correlation between X and Y, but presence of the outlier leads analyst towards an opposite conclusion i.e. a negative correlation.

Presence of an outlier can be catastrophic to any analysis results. If any strategic decisions are taken based on such analysis results, it can be detrimental to business.