R Tutorial 16.0
Cluster Analysis is a statistical technique for unsupervised learning, which works only with X variables (independent variables) and no Y variable (dependent variable).
We covered the topic in length and breadth in a series of SAS based articles (including video tutorials), let's now explore the same on R platform.
_____________________________________________________________________________
Links for series of previous articles on Cluster Analysis, we strongly recommend to go through all (if not possible then at least first two) the following articles in order to understand Cluster Analysis thoroughly.
Cluster Analysis - Ready Reckoner
Hierarchical Clustering - Part 1 - Video
TutorialHierarchical Clustering - Part 2 - Video Tutorial
K-Means Clustering - Part 1 - Video Tutorial
K-Means Clustering - Part 2 - Video Tutorial
Forming cluster with categorical data
_______________________________________________________________________________Hierarchical Clustering - Part 1 - Video
TutorialHierarchical Clustering - Part 2 - Video Tutorial
K-Means Clustering - Part 1 - Video Tutorial
K-Means Clustering - Part 2 - Video Tutorial
Forming cluster with categorical data
Hope, you have taken our advice positively and you know you know that Cluster Analysis can be performed in two ways:
1. Hierarchical Cluster Analysis
2. K-Means Cluster Analysis
Let's try to learn both of these in R one by one. But before that let use a package that is very handy to learn whether the data is Clusterable or not, If it is we would use the same for our analysis.
# Let's first clean the Global Environment (Ask Analytics promotes clean environment, both literally and figuratively)
rm(list = ls())
# Let's first create a replica of R in-built data mtcars
data_1 = mtcars
# As Cluster analysis is distance based algorithm, we need to make all the variables scale free ( All these points have been discussed in above listed articles)
data_2 = as.data.frame(scale(data_1))
# Let's perform the Clusterablity test on the data
if(!require(clustertend)) install.packages("clustertend")
library(clustertend)
hopkins(data_2, n = nrow(data_2)-1)
# If the H value is below the threshold 0.5, it is definitely clusterable, as in our case
# Now while data is clusterable, we can go ahead ...
HCA starts with no assumption of number of clusters, we decide it in the course of analysis itself.
# First we get a distance matrix
dm = dist(as.matrix(data_2),method = "euclidean")
# and then on distance matrix, we do the hierarchical clustering
heir_clust = hclust(dm, method="ward.D")
click to enlarge |
# you can use other methods too such as single, complete, centroid
# Now let's plot a dendogram
plot(heir_clust)
# The dendogram is used to decide the number of cluster that should be made. We need to draw a horizontal line cutting the dendogram beyond which the height, which represents the inter-cluster variance, is increasing sharply and then count the number the clusters, like here red line is the optimum line as blue line is passing through sharp variance increase zone and with red line number of cluster is 3.
click to enlarge |
Cluster_id = cutree(heir_clust, k= 3)
# draw dendogram with red borders around the 3 clusters
rect.hclust(heir_clust, k=3, border="red")
# Let's populate Cluster_id in the data itself.
Result = cbind(data_1,Cluster_id)
# and we are done with hierarchical clustering, let's now understand the k-mean clustering.
K-Means Cluster Analysis
K-Means or Non-hierarchical cluster analysis starts with an assumed number of clusters initially and the number of assumed clusters is "K".
#Initial steps remain same
rm(list = ls())
data_1 = mtcars
data_1 = mtcars
data_2 = as.data.frame(scale(data_1))
# Let's say, we consider K = 3 and we run k-means cluster analysis
kmeans.fit <- kmeans(data_2, 3)
attributes(kmeans.fit)
kmeans_cluster = kmeans.fit$cluster
data_3 = cbind(data_2, kmeans_cluster)
That's it ... The only question left unanswered is: how to decide the "K"?
Well, there are many method, but the method I prefer the most is scree plot:
# Use the following command
X = (nrow(data_2)-1)*sum(apply(data_2,2,var))
for (i in 2:10) {
X[i] = sum(kmeans(data_2,centers=i)$withinss)
}
plot(1:10, X, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
# and you will get the scree plot
click to enlarge the Scree plot |
Why K-Means Cluster Analysis?
Difference between K-Means and Hierarchical Clustering - Usage Optimization
Enjoy reading our other articles and stay tuned with us.
Kindly do provide your feedback in the 'Comments' Section and share as much as possible.
Difference between K-Means and Hierarchical Clustering - Usage Optimization
Enjoy reading our other articles and stay tuned with us.
Kindly do provide your feedback in the 'Comments' Section and share as much as possible.
No comments:
Post a Comment
Do provide us your feedback, it would help us serve your better.