Machine Learning Techniques - I
Machine Learning is a buzz word these days in the world of data science and analytics. R and Python have gone popular as these tools are full of advanced machine learning techniques.
@ Ask Analytics we have covered many basic machine learning techniques so far, now we are starting with advanced techniques!
What is Machine Learning ?
Machine learning is a sub-field of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn without being explicitly programmed". [wiki]
So basically you do not tell your computer much about the data/information and it comes up with insights and findings off the data, on its own.
Following are few machine learning techniques, that we have covered so far:
1. Linear Regression
2. Logistic Regression
3. Decision Tree
4. Cluster Analysis
5. Market Basket Analysis (Apriori algorithm)
6. Principle Component Analysis / Factor Analysis
I know you must be expecting something like below for the term Machine Learning :
But let's face the reality.
'
Let's now focus upon following advanced machine learning techniques:
1. Random Forest
2. Boosting Techniques
3. Naive Bayes
4. KNN
5. Support Vector Machine
We might add more items to the list, as and when we learn these techniques !!!
Now at least these names look quite fancy, right?
Before we tick off the first technique in the list i.e. Random Forest, we better know about the term: Ensemble Learning
What is Ensemble Learning ?
models together and build a resultant model with better predictive power.
Individual model (also known as week learner) might be having more errors and these models might be quite different from each other, although built on the same population of data. In Ensemble learning, individual models come together and bring forth a model that is having less error and as well as less variance in the errors across the model, a strong leaner, a better model.
Even in our routine life, we practice a lots of ensemble learning e.g. before taking a decision on your stcoks' portfolio, you might consider advice from various experts and then you don't follow one person's advice rather make a suitable and best mix for your. Bingo! you are already an ensemble learning expert.
Ensemble Models could be of various types, most popular of which are: Bagging and Boosting
Bagging : In Bagging, various models are built in parallel on various samples and then the various models vote to give the final model and hence prediction.
Let's spend a couple of minutes in order to understand the concept of Bagging:
Suppose we train a CART model on sample data and then we test the model on 4 other samples (illustration shown above) . We find that the terminal nodes of model in case of different samples are not matching. So basically different opinions from different models. What we do now?
We simply take the votes from each of the model and majority wins ! Hence in the model, the left most terminal node finally says "No" with 25% error/75% accurancy and Second one says "Yes" again with 25% error/75% accuracy.
Boosting : In Boosting, the models are built in series. In each successive model, the weights are adjusted based on the learning of previous model. As earlier model's correct and wrong predictions are known, next models tries to improvise further the wrong predicted part. The technique might lead into over fitting on the training data.
Will understand Boosting in greater details in the the upcoming article on Boosting technique.
What is Random Forest ?
Random Forest is a bagging style ensemble learning technique but with a twist and flavor of randomization. In Random Forest, systems build multiple CART models and on various random bootstrapped samples and then each of these models vote democratically to choose the final model.
One thing that I would like to explicitly mention that there is a dual flavor of randomization :
1. Random Sample Selection
2. Random Variables Selection
Let's see how it works and then we can better understand the literal meaning of its name:
How Random Forest works:
1. Suppose we have data set with N observation and P +1 variables, in which one of variables is target variable (Y) and other P are independent variables.
2. Multiple random samples with replacement (bootstrapped) are drawn from data with "n" observations ( n is roughly 2/3 of N). Rest 1/3rd of data is used for simultaneous validation of the model.
3. On the validation part of data, mis-classification rate is calculated (called Out of Bag/OOB Error). Finally the OOB rate is aggregated for all the models/trees and the aggregated error rate is called overall OOB rate.
This is one more the most beautiful feature of Random Forest technique, even if you don not classify the data into Training and Validation part, it will do it by default.
4. Out of P independent variables, "p" variables are selected at random at every node split and then the best one to split is considered. User can specify the number of variables to be considered (p) else for a classification model, p = √ P , by default.
There is a school of thought about this technique that says that here since we are not analyzing all the variables at each node, we might not get the best split. But if we makea good number of trees, more or less this problem gets nullified.
5. User is also supposed to specify the number of tree, he/she wants system to build in forest.
6. Each tree is grown to the maximum extent possible. There is no pruning at all.
7. Finally the result of all the tree models are considered as vote and then based on majority the final prediction is done. (as already illustrated in Bagging)
Let's try running a Random Forest (RF) model, we would learn more stuff abour RF within the process itself.
Finally the Random Forest in R
We will try building the Random Forest model on the same data "Carseats" from ISLR package that we have used for Tree based models.
rm(list = ls())
if (!require(ISLR)) install.packages("ISLR")
library(ISLR)
data_1 = Carseats
hist(data_1$Sales)
data_1$high = as.factor(ifelse(data_1$Sales>=8,"yes","no"))
# For building a classification model, do specify the Y variable as factor
data_1$Sales = NULL
head(data_1)
set.seed(222)
train = sample(1:nrow(data_1), nrow(data_1)*0.7)
Training_data = data_1[train,]
Testing_data = data_1[-train,]
rm(data_1,train)
####################################################
# We would now install the randomForest package####################################################
if (!require(randomForest)) install.packages("randomForest")
library(randomForest)
# Let us run the first random forest model, with 500 trees
set.seed(222)
RF_Model_1 = randomForest(high~.,data=Training_data, ntree=500)
print(RF_Model_1)
# Seed should be supplied otherwise, every time you run the model, the a different random sample is selected and model might change.
Here is the result :>>>>>>>>>>>>>>>>>
So validation error is 20.36% and since there were 10 independent variables, as √10 = 3 (approx.)
We can also use mtry argument in order to change the number of variables tried at each split.
# Use the following code to identify the optimal number of mtry (variables to be selected at each split)
set.seed(222)
Try_various_m = tuneRF(Training_data[,-11],Training_data$high, ntreeTry=500,
stepFactor=2,improve=0.05, plot=TRUE)
# First argument in the tuneRF is data containing independent variables, second is depdendent variable, third is number of trees, stepfacotor is by how much value mtry should be inflated step wise. Finally plot = TRUE give plot shown below.
# We get the plot between OOB error and Mtry, which can help us decide the optimal number of mtry
set.seed(222)
RF_Model__2 = randomForest(high~.,data=Training_data, mtry=6, importance=T,ntree=500)
print(rf)
# Now let us introduce you with one of the quite useful feature of Random Forest Model, that it helps in understanding the individual importance of the variable in classification
importance(rf)
varImpPlot(rf)
Looking at the plots, you can say that ShelvelLoc and Price are most important variable for classification.
Both MeanDecreaseAccuracy and MeanDecreaseGini are the measures of classification power of variables. More these values, more is classification power.
Let's finally use the model for prediction on the Testing data and see how are the predictions:
# To get the prediction in terms of probability
pred_1=predict(rf,type = "prob",Testing_data)
# To get the prediction in terms of response "Yes","No"
pred_2=predict(rf,type = "response",Testing_data)
# Calculate the prediction error
mean(pred_2 !=Testing_data$high)
# Wow !!! only 12.5 % error
# Finally to get the results
result = cbind(Testing_data, pred_1, pred_2)
Let's explore few more featured that can be used :
1. You should try using additional and optional argument nodesize with which you can prune the trees. With this option you ensure the minimum number of observations in each of the terminal nodes.
set.seed(222)
rf <-randomForest(high~.,data=Training_data, mtry=6, importance=T,nodesize = 10 ,ntree=500)
print(rf)
importance(rf)
varImpPlot(rf)
2. The last feature that we would to explain here is treatment of missing values using random forest model.
The randomForest library provided two ways to deal with missing values :
a. na.roughfix : deals with the missing values in classical way. For numeric missing values, it imputes with the column median and for character missing values imputation happend with mode (most occurring value)
# Try the following code to check it out
data_missing = Training_data
data_missing[3,5] = NA
data_missing[3,6] = NA
with_roughfix <- na.roughfix(data_missing)
# The option can directly be used with modeling code
Model_1 = randomForest(high ~ ., data_missing, na.action=na.roughfix)
# Try the following code to check it out
set.seed(222)
prox_imp <- rfImpute(high ~ ., data_missing, iter = 10, ntree = 100)
Here first the missing values get imputed with "na.roughfix" method and then system starts building the random forest model on complete data. The proximity matrix from the randomForest is used to update the imputation of the NAs.
For continuous predictors, the imputed value is the weighted average of the non-missing observations, where the weights are the proximities. For categorical predictors, the imputed value is the category with the largest average proximity. This process is iterated iter times.
Proximity is the similarity between the observations measured in the terms of these falling in the same node. So if we have a data on which we build random forest model with 500 trees.
Suppose there are two observations "x" and "y" that fall in the same node in 140 trees, so their proximity is 140. At the end, proximities are normalized by dividing by the number of trees (500).
Normalized Proximity = 140/500
The second option is considered to be better method of missing value imputration.
What's wrong with Random Forest ?
So far we learnt everything good about the Random Forest technique, but we should also know the dark side of it. The techniques is kind of a BLACK BOX in the sense, we do not have a fair picture of the model like we have it in case of CART.
We don't what rules have been derived by the classification technique. All we get in the end is the prediction and importance of variables, but not any rules or equation like, as we have it in Logisitic Regression or Decision Tree
With this we are ending the article about Random Forest .... all that I can say in the end:
Just one tree has become old fashion, it is now a time of forest.
I really feel like contributing towards the betterment of health of global environment by planting forests, not just a tree !!!
Humble appeal:
Enjoy reading our other articles and stay tuned with us.
Kindly do provide your feedback in the 'Comments' Section and share as much as possible.
This comment has been removed by a blog administrator.
ReplyDelete
ReplyDeleteThis is really interesting blog.Thanks for the sharing
e accounting course in south delhi
e accounting course in south delhi
Thankyou for sharing this post.
ReplyDeletepython training institute in south delhi
python training institute in Noida
Thankyou for sharing this informative blog.
ReplyDeletepython training institute in south delhi
python training institute in Noida
very nice post.
ReplyDeletepython training institute in south delhi
python training institute in Noida
Very nice post.
ReplyDeletepython training institute in south delhi
python training institute in Noida
Thankyou so much for sharing this informative blog
ReplyDeletepython training institute in south delhi
python training institute in Noida
Best Advanced excel institute in Delhi/Noida is High Technologies Soloutions which provides best training in advanced excel. All the trainers have 5+ years experience in their fields. HTS focus more on practical knowledge than theory which makes training sessions more interesting and trainers focus on individual training as well.HTS provides 100% placement help. Join now!! Call at +919311002620 or visit our website.
ReplyDeleteadvanced excel course in Delhi
advanced excel course in Noida
Learn tally from a renowned institute. High technologies Solutions provides the best tally training in Delhi and Noida with 100% placement help. Trainers are subject specialist having 5+ years experience. For free demo class call at +919311002620.
ReplyDeleteTally training institute in delhi
Tally training institute in Noida
High Technologies solutions is the leading institute in Delhi and Noida providing the best AutoCAD course with the very basic of CAD design all the way to advanced CAD.Experienced trainers and 100% placement help. Call +9311002620.
ReplyDeleteautocad training institute in Delhi
autocad training institute in Noida
Join the best Autocad leading institute in Delhi and Noida. High technologies Solutions provides the best AutoCAD course with the very basic of CAD design to advanced CAD.Experienced trainers and 100% placement help. Call +9311002620.
ReplyDeleteautocad training institute in Delhi
autocad training institute in Noida
Enroll in the leading Autocad institute in Delhi and Noida.High Technologies Solutions believes in quality training and provide innovative friendly environment.We also provide live projects, Assignments, free demo class and placement assistance. Call at 9311002620.
ReplyDeleteautocad training institute in Delhi
autocad training institute in Noida
wonderful article. Very interesting to read this article. I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries.
ReplyDeleteThis article content is really unique and amazing. This article really helpful and explained very well. So I am really thankful to you for sharing keep it up..
ReplyDeleteThank you so much for this amazing blog. Keep sharing this type of content with us. If anyone wants to learn Tally in Delhi, I will recommend High Technologies Solutions training institute.
ReplyDeleteFor any further information please call +919311002620 or you can visit website https://www.htsindia.com/Courses/tally/tally-training-course
Thank you for sharing an amazing & wonderful blog. This content is very useful, informative and valuable in order to enhance knowledge. Keep sharing this type of content with us & keep updating us with new blogs. Apart from this, if anyone who wants to join the Advanced Excel Training institute in Delhi, can contact 9311002620 or visit our website-
ReplyDeletehttps://htsindia.com/Courses/Business-Analytics/adv-excel-training-course
wonderful article. Very interesting to read this article. I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries.
ReplyDeleteproducthunt
blogtalkradio
coub
ali
sketchfab
viki
speakerdeck
magcloud
hearthis
musicbrainz
https://bayanlarsitesi.com/
ReplyDeleteFiruzköy
Başıbüyük
Karadeniz
Taşdelen
7İP2Y
ankara parça eşya taşıma
ReplyDeletetakipçi satın al
antalya rent a car
antalya rent a car
ankara parça eşya taşıma
LQGİBM
1F817
ReplyDeleteUrfa Lojistik
Hakkari Evden Eve Nakliyat
Aksaray Parça Eşya Taşıma
Ünye Yol Yardım
Afyon Evden Eve Nakliyat
Sinop Şehir İçi Nakliyat
Aksaray Şehir İçi Nakliyat
Ankara Asansör Tamiri
Erzurum Şehir İçi Nakliyat
9E4D1
ReplyDeleteEtlik Parke Ustası
Niğde Lojistik
Sinop Şehirler Arası Nakliyat
Erzincan Evden Eve Nakliyat
Çerkezköy Asma Tavan
Tekirdağ Fayans Ustası
Nevşehir Şehir İçi Nakliyat
Samsun Şehir İçi Nakliyat
Elazığ Şehir İçi Nakliyat
9B165
ReplyDeleteKripto Para Nasıl Kazılır
Bitcoin Nasıl Kazanılır
Kripto Para Nasıl Alınır
Bitcoin Madenciliği Nasıl Yapılır
Binance Yaş Sınırı
Binance Borsası Güvenilir mi
Coin Nasıl Çıkarılır
Kripto Para Çıkarma
Bitcoin Nasıl Oynanır
7B999
ReplyDeleteantalya sohbet
canlı ücretsiz sohbet
canlı sohbet sitesi
görüntülü sohbet siteleri
tekirdağ en iyi rastgele görüntülü sohbet
karabük kızlarla canlı sohbet
karabük goruntulu sohbet
karaman bedava sohbet chat odaları
ısparta bedava sohbet siteleri
F5297
ReplyDeleteSoundcloud Takipçi Hilesi
Referans Kimliği Nedir
Youtube Beğeni Satın Al
Alyattes Coin Hangi Borsada
Coin Nasıl Çıkarılır
Milyon Coin Hangi Borsada
Sui Coin Hangi Borsada
Luffy Coin Hangi Borsada
Yeni Çıkacak Coin Nasıl Alınır
wonderful article. Very interesting to read this article.
ReplyDeleteDeep Learning Projects for Final Year
EF86A
ReplyDeletebinance 100 dolar
huobi
bybit
kraken
canlı sohbet
binance referans kod
sohbet canlı
bitcoin seans saatleri
kripto para haram mı
018DA
ReplyDeleteücretli şov sanal
88CB6
ReplyDeletegüvenilir ücretli şov