Cluster Analysis in R

R Tutorial 16.0


Cluster Analysis is a statistical technique for unsupervised learning, which works only with X variables (independent variables) and no Y variable (dependent variable).

We covered the topic in length and breadth in a series of SAS based articles (including video tutorials), let's now explore the same on R platform.

_____________________________________________________________________________
Links for series of previous articles on Cluster Analysis, we strongly recommend to go through all (if not possible then at least first two) the following articles in order to understand Cluster Analysis thoroughly.


_______________________________________________________________________________

Hope, you have taken our advice positively and you know you know that Cluster Analysis can be performed in two ways:

1. Hierarchical Cluster Analysis
2. K-Means Cluster Analysis

Let's try to learn both of these in R one by one. But before that let use a package that is very handy to learn whether the data is Clusterable or not, If it is we would use the same for our analysis.

# Let's first clean the Global Environment (Ask Analytics promotes clean environment, both literally and figuratively)

rm(list = ls())

# Let's first create a replica of R in-built data mtcars 
data_1 = mtcars

# As Cluster analysis is distance based algorithm, we need to make all the variables scale free ( All these points have been discussed in above listed articles)
data_2 = as.data.frame(scale(data_1))

# Let's perform the Clusterablity test on the data
if(!require(clustertend)) install.packages("clustertend")
library(clustertend)
set.seed(100)
hopkins(data_2, n = nrow(data_2)-1)

# If the H value is below the threshold 0.5, it is definitely clusterable, as in our case
# Now while data is clusterable, we can go ahead ...

Hierarchical Cluster Analysis (HCA)

HCA starts with no assumption of number of clusters, we decide it in the course of analysis itself.

# First we get a distance matrix
dm = dist(as.matrix(data_2),method = "euclidean")  

# and then on distance matrix, we do the hierarchical clustering
heir_clust = hclust(dm, method="ward.D")
click to enlarge
            
# you can use other methods too such as single, complete, centroid
# Now let's plot a dendogram
plot(heir_clust)


# The dendogram is used to decide the number of cluster that should be made. We need to draw a horizontal line cutting the dendogram beyond which the height, which represents the inter-cluster variance, is increasing sharply and then count the number the clusters, like here red line is the optimum line as blue line is passing through sharp variance increase zone and with red line number of cluster is 3.

click to enlarge
 # let's cut tree into 3 clusters
Cluster_id = cutree(heir_clust, k= 3)

# draw dendogram with red borders around the 3 clusters 
rect.hclust(heir_clust, k=3, border="red") 

# Let's populate Cluster_id in the data itself.
Result = cbind(data_1,Cluster_id)


# and we are done with hierarchical clustering, let's now understand the k-mean clustering.




K-Means Cluster Analysis

K-Means or Non-hierarchical cluster analysis starts with an assumed number of clusters initially and the number of assumed clusters is "K".

#Initial steps remain same
rm(list = ls())
data_1 = mtcars
data_2 = as.data.frame(scale(data_1))

# Let's say, we consider K = 3 and we run k-means cluster analysis

kmeans.fit <- kmeans(data_2, 3)
attributes(kmeans.fit)
kmeans_cluster = kmeans.fit$cluster
data_3 = cbind(data_2, kmeans_cluster)

That's it ... The only question left unanswered is: how to decide the "K"?

Well, there are many method, but the method I prefer the most is scree plot:

# Use the following command

X = (nrow(data_2)-1)*sum(apply(data_2,2,var))
for (i in 2:10) {
  X[i] = sum(kmeans(data_2,centers=i)$withinss)
plot(1:10, X, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

# and you will get the scree plot

click to enlarge the Scree plot
# So as the number of cluster increase, within cluster variance goes down and inter cluster variance goes up. We have witnessed the Inter-cluster variance in Dendogram above, in Scree plot within cluster variance comes into play. The within cluster is reaching to an optimally low level at 3rd point, beyond which it is not decreasing that sharply. So K = 3 is a good choice, the choice is little subjective though.

Why K-Means Cluster Analysis?

Difference between K-Means and Hierarchical Clustering - Usage Optimization


Enjoy reading our other articles and stay tuned with us.

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.