### R Tutorial 16.0

Cluster Analysis is a statistical technique for unsupervised learning, which works only with X variables (independent variables) and no Y variable (dependent variable).

We covered the topic in length and breadth in a series of SAS based articles (including video tutorials), let's now explore the same on R platform.

**_____________________________________________________________________________**

**Links for series of previous articles on Cluster Analysis, we strongly recommend to go through all (if not possible then at least first two) the following articles in order to understand Cluster Analysis thoroughly.**

*Cluster Analysis - Ready Reckoner*

*Hierarchical Clustering - Part 1 - Video*

*TutorialHierarchical Clustering - Part 2 - Video Tutorial*

*K-Means Clustering - Part 1 - Video Tutorial*

*K-Means Clustering - Part 2 - Video Tutorial*

*Forming cluster with categorical data*

**_______________________________________________________________________________**

Hope, you have taken our advice positively and you know you know that Cluster Analysis can be performed in two ways:

*1. Hierarchical Cluster Analysis*

*2. K-Means Cluster Analysis*

Let's try to learn both of these in R one by one. But before that let use a package that is very handy to learn whether the data is

**Clusterable**or not, If it is we would use the same for our analysis.

# Let's first clean the Global Environment

**(Ask Analytics promotes clean environment, both literally and figuratively)**

**rm(list = ls())**

**# Let's first create a replica of R in-built data mtcars**

data_1 = mtcars

**# As Cluster analysis is distance based algorithm, we need to make all the variables scale free ( All these points have been discussed in above listed articles)**

data_2 = as.data.frame(scale(data_1))

**# Let's perform the Clusterablity test on the data**

if(!require(clustertend)) install.packages("clustertend")

library(clustertend)

hopkins(data_2, n = nrow(data_2)-1)

# If the H value is below the threshold 0.5, it is definitely clusterable, as in our case

# Now while data is clusterable, we can go ahead ...

**Hierarchical Cluster Analysis (HCA)**

**HCA starts with no assumption of number of clusters, we decide it in the course of analysis itself.**

**# First we get a distance matrix**

dm = dist(as.matrix(data_2),method = "euclidean")

**# and then on distance matrix, we do the hierarchical clustering**

heir_clust = hclust(dm, method="ward.D")

click to enlarge |

**# you can use other methods too such as single, complete, centroid**

**# Now let's plot a dendogram**

plot(heir_clust)

# The dendogram is used to decide the number of cluster that should be made. We need to draw a horizontal line cutting the dendogram beyond which the

**height, which represents the inter-cluster varianc**e, is increasing sharply and then count the number the clusters, like here red line is the optimum line as blue line is passing through sharp variance increase zone and with red line number of cluster is 3.

click to enlarge |

**# let's cut tree into 3 clusters**

Cluster_id = cutree(heir_clust, k= 3)

**# draw dendogram with red borders around the 3 clusters**

rect.hclust(heir_clust, k=3, border="red")

**# Let's populate Cluster_id in the data itself.**

Result = cbind(data_1,Cluster_id)

**# and we are done with hierarchical clustering, let's now understand the k-mean clustering.**

K-Means Cluster Analysis

K-Means or Non-hierarchical cluster analysis starts with an assumed number of clusters initially and the number of assumed clusters is "K".

#Initial steps remain same

rm(list = ls())

data_1 = mtcars

data_1 = mtcars

data_2 = as.data.frame(scale(data_1))

**# Let's say, we consider K = 3 and we run k-means cluster analysis**

kmeans.fit <- kmeans(data_2, 3)

attributes(kmeans.fit)

kmeans_cluster = kmeans.fit$cluster

data_3 = cbind(data_2, kmeans_cluster)

That's it ... The only question left unanswered is: how to decide the "K"?

Well, there are many method, but the method I prefer the most is scree plot:

**# Use the following command**

X = (nrow(data_2)-1)*sum(apply(data_2,2,var))

for (i in 2:10) {

X[i] = sum(kmeans(data_2,centers=i)$withinss)

}

plot(1:10, X, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

**# and you will get the scree plot**

click to enlarge the Scree plot |

Why K-Means Cluster Analysis?

Enjoy reading our other articles and stay tuned with us.

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.

*Difference between K-Means and Hierarchical Clustering - Usage**Optimization*Enjoy reading our other articles and stay tuned with us.

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.