Outlier Detection - Univariate


<<< Click here to go back 

Uni-variate Methods of Outlier Detection


1 . Visual Method

An outlier can be detected by looking at the data itself or plotting the data in various type of charts such as Line Chart, Line chart, etc. Box plot is another classical method for outlier detection visually.




2. Z Score (Grubbs and Dixon Test )

Visual Inspection, though very useful most of the time, can’t be used while automating the process as it need human intervention. Z score is very useful method to detect the outliers in a uni-variate fashion. In this method data is normalized at origin by deducting mean from the whole series and dividing it by standard deviation. 

In SAS it is done using Proc Standard:

Proc standard data = Data_name  Mean = 0 STD = 1 out = standared_data;
Var var_name;
By class_variables;    /* optional */
run;



Post Z Score calculations following rules are applied for outlier detection:

1. If the data set is very small (N <= 50), observations which has z-score smaller than -2.5 or larger than 2.5 might be regarded as outliers.

2. In larger data sets (N>50), observations with z-scores smaller than -3.3 or larger than 3.3 are typically regarded as outliers. 

3. If the sample is too large (N >1000), a higher cut-off might be selected for considering an obsevation an outlier.


In case of multiple outliers, we can detect them by treating one outlier at a time. It is an iterative process.

For more details please check the embedded file. The reason is that Mean and Standard Deviation get changed with removal of an outliers drastically. Download the demo file from the link provided below.


3. MAD (Median Absolute Deviation) Method

The method is quite similar to Z-stats method, with a contrast of using median instead of using mean.
The shortcoming of Z-Score method, i.e. inability to detect multiple outliers at once, can be overcome by using a parameter (median) which is not that vulnerable to outlier presence.

In this method a modified Z-Score is calculated as :

Modified Z  = 0.6745 * (Xi -Median(Xi)) / MAD

Where MAD is median of Xi -Median(Xi)

 MAD stands for Median Absolute Deviation. Any number in a data set with the absolute value of modified Z-score exceeding 3.5 is considered an "Outlier“.

 To understand the process for the method, please refer to the embedded file (MAD) in the current slide.





Let's take a break here, we would cover the rest about outlier detection and treatment in a subsequent articles :

Seasonality Index and Trend Variables

Outlier Detection &Treatment - Part 2 - Multivariate


Enjoy reading our other articles and stay tuned with ...

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.



Family of the article:


1. Outlier Detection &Treatment - A fresh perspective
2. Seasonality Index and Trend Variables
3. Outlier Detection &Treatment - Part 2 - Multivariate

No comments:

Post a Comment

Do provide us your feedback, it would help us serve your better.