Ensemble Technique - Random Forest in R

Machine Learning Techniques - I

Machine Learning is a buzz word these days in the world of data science and analytics. R and Python have gone popular as these tools are full of advanced machine learning techniques.

@ Ask Analytics we have covered many basic machine learning techniques so far, now we are starting with advanced techniques!
What is Machine Learning ?

Machine learning is a sub-field of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn without being explicitly programmed"[wiki]

So basically you do not tell your computer much about the data/information and it comes up with insights and findings off the data, on its own.

Following are few machine learning techniques, that we have covered so far:

1. Linear Regression
2. Logistic Regression
3. Decision Tree
4. Cluster Analysis
5. Market Basket Analysis (Apriori algorithm)
6. Principle Component Analysis / Factor Analysis

I know you must be expecting something like below for the term Machine Learning :

But let's face the reality.
Let's now focus upon following advanced machine learning techniques:

1. Random Forest
2. Boosting Techniques
3. Naive Bayes
4. KNN
5. Support Vector Machine

We might add more items to the list, as and when we learn these techniques !!!

Now at least these names look quite fancy, right?

Before we tick off the first technique in the list i.e. Random Forest, we better know about the term: Ensemble Learning

What is Ensemble Learning ?

The word ensemble means "Group" and Ensemble Learning means considering a group of predictive
models together and build a resultant model with better predictive power.

Individual model (also known as week learner) might be having more errors and these models might be quite different from each other, although built on the same population of data. In Ensemble learning, individual models come together and bring forth a model that is having less error and as well as less variance in the errors across the model, a strong leaner, a better model.

Even in our routine life, we practice a lots of ensemble learning e.g. before taking a decision on your stcoks' portfolio, you might consider advice from various experts and then you don't follow one person's advice rather make a suitable and best mix for your. Bingo! you are already an ensemble learning expert.

Ensemble Models could be of various types, most popular of which are: Bagging and Boosting

Bagging :  In Bagging, various models are built in parallel on various samples and then the various models vote to give the final model and hence prediction.

Let's spend a couple of minutes in order to understand the concept of Bagging:

Suppose we train a CART model on sample data and then we test the model on 4 other samples (illustration shown above) . We find that the terminal nodes of model in case of different samples are not matching. So basically different opinions from different models. What we do now?

We simply take the votes from each of the model and majority wins ! Hence in the model, the left most terminal node finally says "No" with 25% error/75% accurancy and Second one says "Yes" again with 25% error/75% accuracy.

Boosting :  In Boosting, the models are built in series. In each successive model, the weights are adjusted based on the learning of previous model. As earlier model's correct and wrong predictions are known, next models tries to improvise further the wrong predicted part. The technique might lead into over fitting on the training data.

Will understand Boosting in greater details in the the upcoming article on Boosting technique.

What is Random Forest ?

Random Forest is a bagging style ensemble learning technique but with a twist and flavor of randomization.  In Random Forest, systems build multiple CART models and on various random bootstrapped samples and then each of these models vote democratically to choose the final model.

One thing that I would like to explicitly mention that there is a dual flavor of randomization :

1. Random Sample Selection
2. Random Variables Selection

Let's see how it works and then we can better understand the literal meaning of its name:

How Random Forest works:

1. Suppose we have data set with N observation and P +1 variables, in which one of variables is target variable (Y) and other P are independent variables.

2. Multiple random samples with replacement (bootstrapped) are drawn from data with "n" observations ( n is roughly 2/3 of N).  Rest 1/3rd of data is used for simultaneous validation of the model.

3. On the validation part of data, mis-classification rate is calculated (called Out of Bag/OOB Error). Finally the OOB rate is aggregated for all the models/trees and the aggregated error rate is called overall OOB rate.

This is one more the most beautiful feature of Random Forest technique, even if you don not classify the data into Training and Validation part, it will do it by default.

4. Out of P independent variables, "p" variables are selected at random at every node split and then the best one to split is considered. User can specify the number of variables to be considered (p) else for a classification model, p =  √ P , by default.

There is a school of thought about this technique that says that here since we are not analyzing all the variables at each node, we might not get the best split. But if we makea good number of trees, more or less this problem gets nullified.

5.  User is also supposed to specify the number of tree, he/she wants system to build in forest.

6.  Each tree is grown to the maximum extent possible. There is no pruning at all.

7. Finally the result of all the tree models are considered as vote and then based on majority the final prediction is done. (as already illustrated in Bagging)

Let's try running a Random Forest (RF) model, we would learn more stuff abour RF within the process itself.

Finally the Random Forest in R

We will try building the Random Forest model on the same data "Carseats" from ISLR package that we have used for Tree based models.

rm(list = ls())
if (!require(ISLR)) install.packages("ISLR")
data_1 = Carseats

data_1$high = as.factor(ifelse(data_1$Sales>=8,"yes","no"))

# For building a classification model, do specify the Y variable as factor

data_1$Sales = NULL
train = sample(1:nrow(data_1), nrow(data_1)*0.7)
Training_data =  data_1[train,]
Testing_data =  data_1[-train,]

# We would now install the randomForest package####################################################

if (!require(randomForest)) install.packages("randomForest")

# Let us run the first random forest model, with 500 trees
RF_Model_1 = randomForest(high~.,data=Training_data, ntree=500) 

# Seed should be supplied otherwise, every time you run the model, the a different random  sample is selected and model might change.

Here is the result :>>>>>>>>>>>>>>>>>

So validation error is 20.36% and since there were 10 independent variables, as √10 = 3 (approx.)

We can also use mtry argument in order to change the number of variables tried at each split.

# Use the following code to identify the optimal number of mtry (variables to be selected at each split)

Try_various_m = tuneRF(Training_data[,-11],Training_data$high, ntreeTry=500,
               stepFactor=2,improve=0.05,  plot=TRUE)

# First argument in the tuneRF is data containing independent variables, second is depdendent variable, third is number of trees, stepfacotor is by how much value mtry should be inflated step wise. Finally plot = TRUE give plot shown below.

# We get the plot between OOB error and Mtry, which can help us decide the optimal number of mtry
RF_Model__2 = randomForest(high~.,data=Training_data, mtry=6, importance=T,ntree=500)

# Now let us introduce you with one of the quite useful feature of Random Forest Model, that it helps in understanding the individual importance of the variable in classification



Looking at the plots, you can say that ShelvelLoc and Price are most important variable for classification.

Both MeanDecreaseAccuracy and MeanDecreaseGini are the measures of classification power of variables. More these values, more is classification power.

Let's finally use the model  for prediction on the Testing data and see how  are the predictions:

 # To get the prediction in terms of probability
pred_1=predict(rf,type = "prob",Testing_data)
# To get the prediction in terms of response "Yes","No"
pred_2=predict(rf,type = "response",Testing_data) 
 # Calculate the prediction error
mean(pred_2 !=Testing_data$high)
 # Wow !!! only 12.5 % error

# Finally to get the results
result = cbind(Testing_data, pred_1, pred_2)

You can try more variations and get better result. Try various ntree, mrty values.

Let's explore few more featured that can be used :

1. You should try using additional and optional argument nodesize with which you can prune the trees. With this option you ensure the minimum number of observations in each of the terminal nodes.

rf <-randomForest(high~.,data=Training_data, mtry=6, importance=T,nodesize = 10 ,ntree=500)

2. The last feature that we would to explain here is treatment of missing values using random forest model.

The randomForest library provided two ways to deal with missing values :

a. na.roughfix : deals with the missing values in classical way. For numeric missing values, it imputes with the column median and for character missing values imputation happend with mode (most occurring value)

# Try the following code to check it out
data_missing = Training_data
data_missing[3,5] = NA
data_missing[3,6] = NA
with_roughfix <- na.roughfix(data_missing)

# The option can directly be used with modeling code
Model_1 = randomForest(high ~ ., data_missing, na.action=na.roughfix)

b. rfimpute :  The option helps impute the missing values using proximity concept.

# Try the following code to check it out
prox_imp <- rfImpute(high ~ ., data_missing, iter = 10, ntree = 100)

Here first the missing values get imputed with "na.roughfix" method and then system starts building the random forest model on complete data. The proximity matrix from the randomForest is used to update the imputation of the NAs.
For continuous predictors, the imputed value is the weighted average of the non-missing observations, where the weights are the proximities. For categorical predictors, the imputed value is the category with the largest average proximity. This process is iterated iter times.

Proximity is the similarity between the observations measured in the terms of these falling in the same node. So if we have a data on which we build random forest model with 500 trees.
Suppose there are two observations "x" and "y" that fall in the same node in 140 trees, so their proximity is 140. At the end, proximities are normalized by dividing by the number of trees (500).

Normalized Proximity = 140/500

The second option is considered to be better method of missing value imputration.

What's wrong with Random Forest ?

So far we learnt everything good about the Random Forest technique, but we should also know the dark side of it. The techniques is kind of a BLACK BOX in the sense, we do not have a fair picture of the model like we have it in case of CART.

We don't what rules have been derived by the classification technique. All we get in the end is the prediction and importance of variables, but not any rules or equation like, as we have it in Logisitic Regression or Decision Tree

With this we are ending the article about Random Forest .... all that I can say in the end:

Just one tree has become old fashion, it is now a time of forest. 

I really feel like contributing towards the betterment of health of global environment by planting forests, not just a tree !!!

Humble appeal

Enjoy reading our other articles and stay tuned with us.

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.