How multicollinearity can be hazardous to your model ?


All about Multicollinearity

In almost all popular techniques e.g. Linear Regression, Logistic Regression, Cluster Analysis, it is advised to check and remove the traces of multicollinearity. Why ??? Is it important at all, or can we skip it. How does it affect the model ?

Let's try to explore answers of all these questions ... Secret of multicollinearity revealed !

First of all let's understand what is "Multicollinearity"...

Suppose in an analysis, we consider various variables .... a b c d e f ......

If there is a high correlation between various pairs of variables or intrinsically one of the variable might be derivative of a combination of other variables, there are traces of multicollinearity in data.

Example :
You consider GDP, Disposable Income, Money Supply, Temperature, Precipitation, WDD (Weather Driven Demand) or Weather Index, Seasonality Index, Trend, Calamity variables, Penetration and other distribution variables for a regression modeling.

While you check the correlation matrix of variables, probably you would find GDP and Disposable Income highly correlated, as both the economic health indicator and inherently, they go hand in hand.

Also, WDD and Weather Index are derivative of Temperature, Precipitation and along with other weather variables, so considering all together doesn't look okay even with common sense, leave aside statistics.

How can we detect it ?

Well, detection of multicollinearity is a child's play.

1. If you are working on a unsupervised learning model such as Cluster Analysis, multicollinearity can be detected using a correlation matrix. Wherever the absolute of correlation is more than 0.6 or 0.7, we can say that variables are highly correlated and multicollinearity  exists.

Supervised learning - Technique in which Y(Dependent variable) is studied on X(Independent variables).
Unsupervised learning - No Y variable

2. In case of supervised learning model such as Linear or Logistic Regression etc., we can detect the multicollinearity using variance inflation (VIF) in Proc Reg. We can do it using correlation matrix too, but VIF is better.

Why VIF is better ?

Correlation matrix checks only one to one correlation, however VIF checks multicollinearity more thoroughly (in multivariate fashion).
Syntax used :

Proc Reg data = masterdata;
Model y = a b c d e f/ vif;
Run;Quit;




Generally we consider VIF more than 10 to be indicative of multicollinearity. If we get two variables with high VIF, we drop only one of those two variables (highest VIF or using business acumen) at a time and check again( If we are using elimination method).

Although, it is not practiced, but no one stops you from using the VIF method for Cluster Analysis too. How come, as we don't have a Y variable? Well, you can fake it. Consider a fake Y variable and run Proc Reg. You can see Y is not at all involved in calculations of VIF.

3. The third method is applicable globally on Supervised as well as unsupervised learning techniques. It is variable reduction techniques : Factor Analysis, Principal Component Analysis and Variables' Clustering.
For more details on this read our article on : Variable Reduction Techniques

How does it affect the model ?

Well, it affects our model in multiple ways :

a.  It makes the model expensive. If we are using redundant variables in the model and these variables come at a cost, ain't we increasing the cost of model. This is most applicable in case of Clinical Analysis.

b. It results into over-fitting of the model, as with each additional variable R Square is bound to increase; gap between R-Square and Adjusted R-Square increase though, which is also indicative of redundancy of variables or multicollinearity.

c. In linear and Logistic regression, the beta coefficient or estimates are not true representative if multicollinearity is there. 

Suppose a and b are two variables which are highly correlated and we consider both the variable in model.
Suppose, if the vector (result contribution) with "a" alone is 10, it is possible that if you use both variable together vector of "a" comes +15 and that of "b" comes as -5, resultant though remains same. Beta coefficients adjust among themselves, but no longer remain true representative. it is very difficult to understand in words, would cover it soon in our case study on Linear regression.

Enjoy reading our other articles and stay tuned with ...

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.