Linear Regression with R - Model assumptions check-up

R Tutorial 15.0


In the previous blog, we learned making a linear regression model. The only thing we left was necessary testing of certain assumption. Testing these assumptions is equivalent to health check-up of the model ; a healthy model is supposed to be a robust model.

Let's play doctor-doctor!



The artcile has been written in continuation of  our previous article : Linear Regression with R

There are 4 (few statisticians consider 5) important assumptions for Linear Regression :

1.  Assumption of Linearity : 

The assumption of linearity is checked  during the course of modeling it self, by plotting x-y scatter plot of dependent and one of the independent variables at a time. We check for a linear pattern and if necessary, we do the transformations (e.g. log, square, roots  etc.)

2.  No Multicollinearity: 

This is also considered as assumption for a linear regression model, I consider it to be important, but not part of 4 main assumptions. Anyways, in order to satisfy it, we check the VIF and drop the variables during the course of modeling itself.


Above two assumptions have been checked in the previous article itself : 

Linear Regression with R


3.  Normality of residual term : 


It is time to do a post-model assumption check. There are two ways to check the normality or residual term :
Method 1 :

layout(matrix(c(1), 1, 1))
qqplot(model_2, main = "QQ Plot")

All the residuals should lie on the straight red line (as much as possible) and within the dotted red lines zone to ensure the normality or error.







Method 2 :

residuals = residuals(model_2)
shapiro.test(residuals)

The null hypothesis of the Shapiro-Wilk test in normality, and hence in order to not being able to reject the null hypothesis ( simpler terms : accept the null hypothesis), the p value should be > 0.05 (for 95% confidence interval). Residuals are is normal here.

4.  Homogeneity of variance of residuals ( Homoscedasticity): 


We can check it using Breusch-Pagan test ( no need to remember such names ), the null hypothesis of which is homoscedasticity.

ncvTest(model_2)

Here too p value should be > 0.05 (for 95% confidence interval), in order to accept the null hypothesis of homoscedasticity. Model seems to be in trouble in this test.

5.  Independence of residuals (No auto-correlation in residuals): 


durbinWatsonTest(model_2)


The D-W stats value lies between 0 and 4, closer its value is to 2, better the model is.

Here value approaching to 3, is sign of some traces of auto-correlation in residual term.


Let's go extra mile

# Use command 

layout(matrix(c(1), 1, 1))
plot(model_2)

# and you will get the set of 4 plots

click to enlarge
The first and second plots should be random in nature with no pattern. The third one depicts the normality of error as explained above. The fourth one gives you a fair idea about presence of outliers that should be treated to improve the model.

Enjoy reading our other articles and stay tuned with us.

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.

No comments:

Post a Comment

Do provide us your feedback, it would help us serve your better.