Missing Value Treatment

<<< Looking at the picture itself, I know, who comes to your mind ... Aishwarya Rai, Right ?  >>>

Do you know what have you done ? You have imputed a missing value. While you already know missing value imputation, why am I writing this article ? All right, then let me write on "How should missing value be imputed in data, statistically ?"


What is a Missing Value ?

I need not to define missing value, as we already understand the word literally.

What are the reasons behind Missing Value ?

Well, there can be multiple reasons for it. It could be error of data processing person/software, It can be a case of data not being captured for particular scenario ... and so on.

What's the problem with having missing value ? 
Sample from Data 1

Suppose you have a dataset for modeling with 5 variables (1 dependent and 4 independent) and 12 observations. You are running a Linear Regression or Logistic Regression on the same.

Statistical softwares won't consider the observations with either of the variable containing missing value. Consider the example (right), observations with missing values (Red) won't be considered by software in analysis. We lose information for Feb, Apr, Jul and October month, which is 33.33% of the data.

What should I do then?

Know thy data !  You need take a call, looking at your data. In a very large data size, if you have 1% - 2% missing value, you can go ahead leaving those observations. However, this doesn't apply in case of time-series data; we need to impute the missing value, however less these might be.

If your data size is small and you don't want to take a a risk of losing even a single observation, you would need to impute these missing values.

How should I impute the missing value ?

Now you are talking ... Let me tell you various methods (conventional and non-conventional) for missing value imputation.

 Method 1: Impute with Mean/Median


One of the methods suggested for missing value imputation is replacing the missing value with one of the central tendency measures : mean or median. However, you can be little tricky while doing so.

Suppose, you a data with following columns : Country, state, city, year, month, temperature and you need to impute missing value of temperature.
Rather than replacing missing value with mean of whole country all months' data, it is better to consider mean of relevant portion of data i.e. better we consider the mean which is city and month specific (provided data in available for the city and month combination in another years).


Not recommended : as it would replace with the mean of whole column
Proc SQL;
Create table imputed as select Country, state, city, year, month, coalesce (temperature, mean(temperature)) as temperature
from original_data;
Quit;

Recommended : as it would replace with the city and month specific mean
Proc SQL;
Create table imputed as select Country, state, city, year, month, coalesce (temperature, mean(temperature)) as temperature
from original_data
group by Country, state, city, month;
Quit;


So, if October month's data of a particular year for Delhi is missing, we replace the mean data with of October months of Delhi itself, not nation wide all months' average.

Method 2: Impute in a Interpolation style

In the case of time series, we suggest to either follow either Method 1, or imputation with mean of lag and lead. Suppose, sales data for July is missing, you can impute it with the mean sales of Jul and August of the same year.

Proc expand data = parent_data out = imputed;
Convert sales = lag_sales /transformout = (lag 1);
Convert sales= lead_sales /transformout = (lead 1);
Run;

Data imputed;
Set  imputed;
If sales = . then sales = sum(lag_sales,lead_sales)/2;
Run;

Unlike method 1, method 2 doesn't consider the seasonality effect while imputing missing values.

Method 3: Regression Method

This is quite evolved method of missing value imputation. Suppose we have data with Y, X1, X2, X3, X4 variables as shown in data 1 above. Now, we need to impute the missing value of X1 while we know that X1 is highly correlated with X2 and X4. We can regress the variable X1 over X2 and X4 and get the missing values imputed. Let's see how :

Consider the data... 

data parent_data;
Input Year Month $ Y X1 X2 X3 X4;
cards;
2014 Jan 1766 15 44 114 463
2014 Feb 1634 11 50 . 400
2014 Mar 1631 20 50 123 466
2014 Apr 1990  . 45 123 426
2014 May 1791 10 47 133 429
2014 Jun 1491 13 50 132 407
2014 Jul 1713 19 . 133 408
2014 Aug 1708 18 47 111 489
2014 Sep 1641 19 48 140 402
2014 Oct 1721 13 44 111 .
2014 Nov 1436 16 49 149 408
2014 Dec 1151 15 45 133 456
2015 Jan 1677 20 46 148 458
2015 Feb 1087 13 46 109 413
2015 Mar 1410 20 48 100 471
2015 Apr 1457 13 40 128 464
2015 May 1821 16 48 132 422
2015 Jun 1515 11 49 123 464
2015 Jul 1899 18 40 110 424
2015 Aug 1678 17 40 145 415
2015 Sep 1061 16 50 124 428
2015 Oct 1432 20 41 121 474
2015 Nov 1912 10 41 127 404
2015 Dec 1862 10 49 110 446

;
Run;

Now regress and impute...

proc Reg data = parent_data ;
model x1 = x2 x4;
output out = imputed p = pred;
run;
quit;

data imputed;
set  imputed;
if x1 = . then x1 = pred;
run;


This data has been provided for demo purpose only, please try the same on some real project data, where it makes more sense. Please ensure the variables come significant in the model and model fits well, else it is not advisable to accept the imputations.

Useful SAS Procedure : Proc Expand

Using very basic form of Proc Expand imputes all the missing value using a cubic spline function.

Proc expand data= parent_data out = imputed ;

Run;

You  can also use this sort of the options :

Proc Expand data = parent_data out = impute_1;
convert x1 = x1_1 / transformout=( missonly mean );

Run;

What precautions should I take while imputing missing values ?

While imputing missing values with mean in method 1 or method 2, one need to be very careful of presence outliers. The presence of outliers (extreme low or high) values might skew the mean and lead the wrong imputation of missing values. Hence it is advised to first check the outliers and then only impute missing values with mean. Median, on the other hand, remains unaffected due to outliers, and hence median can be preferred over mean, in case outliers are there.

How to I check for outliers ?


Wait for it ! Soon, we are going to write and article on the outlier detection.


Enjoy reading our other articles and stay tuned with ...

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.

No comments:

Post a Comment

Do provide us your feedback, it would help us serve your better.