The Signature Correlogram of R

R Tutorial 13.0

The feature of R that tempted millions including me is its superb infographics (Graphical representation of Information). Correlogram is one of its signature plots that is used to visualize the correlation plot among various variables in a data.

Let's be privileged one to learn the same.

I am assuming that you are well versed with "What is Correlation ?" ... but yet here is the definition just to brush up the concept.

Correlation is a statistical measure indicating the extent to which two or more variables fluctuate together.

If the fluctuation is uni-directional (if one variable increases, other one increases too) then variables are positively correlated and if the fluctuation is in opposite direction (if one variable increases, other one decreases) the correlation is negative.

Do remember that correlation is just a mathematical representation of relationship which has nothing to do with causal relationship. Let me put it in this way : Correlation doesn't imply causation !

Correlation calculation is one of the important step of driver analysis, when we perform a Linear Regression Analysis.

I am no statistician to cover all variants of correlation, all I know are the two types of correlations:

1. Pearson Correlation : to find the relationship between two continuous variables
2. Spearman Correlation : to find relationship between discrete data ( even if one of the variable is discrete)

Without going much into details, let's learn how to calculate the correlation among pair of variables in a data set and then how to visualize it.

Let's prepare a data with traces of missing values:

data_1 = mtcars[1:4]
data_1[2,2] = NA
data_1[2,3] = NA

Correlation Matrix :

# First - Simple correlation Matrix

You can notice that correlation is not calculated with variables with even one missing value. In order to enable system to calculate correlation for all the variables, we can use following commands :

cor(data_1, use="complete.obs")
# Ignores the complete observation with missing value in either of columns.

cor(data_1, use="pairwise.complete.obs")

# Ignores only pairs with missing value in either of columns, rather ignoring the complete observation.

The third argument that we can add in the code is "method". We can choose options pearson, spearman and kendall.

cor(data_1, use="pairwise.complete.obs", method = "pearson")

The "cor" command is fine as such, but sometimes you need to test the statistical significance of the correlation along with the strength of the relationship. In that case, you need to use :

rcorr function of "Hmisc" package:

The input to the function must be a matrix and by default pairwise deletion is used.

rcorr(as.matrix(data_1), type = "pearson")  # methods available are pearson and spearman only    
In the output, it first gives the correlation matrix, then the number of observations involved in the calculation of correlation and last the p value, which tells the statistical significance of the correlation coefficient.

Let's learn visualization of the correlation matrix which is interesting part of this blog :

The Correlogram >>>