Ask Analytics: The concepts of Bagging and Boosting

Ensemble Learning Techniques

In one of the previous posts we covered Random Forest, one of the most popular ensemble learning techniques. We covered concepts of bagging and boosting there, but not much in depth. Let's now understand the concepts of ensemble learning in greater details.

Ensemble Learning

Let's understand the concepts of Ensemble Learning with an interesting examples.

In India, among our generation, everyone must have watched famous quiz show Kaun Banega Crorepati (KBC), hosted by Amitabh Bachchan. The show's concept was copied from the Who Wants to Be a Millionaire, a US game show.

In the show initially, contestant used to get 3 life lines :

1. Phone a friend : Could call a friend who was supposedly an "expert", to seek the answer
2. 50:50 : 2 wrong options go away, leaving one right and one wrong option
3. Audience Poll : People who were their present as audience were asked about the opinion on the answer, and they used to vote. The voting results were then told to candidate.

It was experienced that "expert" friend was quite often wrong, but "non-expert" audience used to vote the correct answer almost always.

I never knew the reason, how that happened but I can relate it a lot with Ensemble Learning.

Using the analogy, we can say that Ensemble Learning is a technique in which many of the less accurate models (weak learners) are considered collectively and a resultant model derived off these models comes highly accurate. Individual model might be less accurate, but collective model is highly accurate.

There are various techniques types of Ensemble Learning :

1. Bagging (abbreviation of Bootstrapped Aggregating)

1.1. Random Forest

2. Boosting

We shall now understand the concept of each of the above techniques one by one.

Bagging

In Bagging bootstrapped samples from a data. Bootstrapped means random sampling with replacement.

Bootstrapping
Example : You have 8 coins in a bowl and you are asked to pick 5 coins at random. But after you pick one coin, you note down its number and then put it back in the bowl. Now it is likely that the you might choose the same coin again in next 4 picks. So you can have sample as illustrated below:

Coins in the Bowl: {1,2,3,4,5,6,7,8}
Sample 1 : {1,5,2,9,2}
Sample 2 : {8,4,3,3,6}

Sample 3 : {2,5,1,7,4}

So at first M sample are taken from the data and then model is trained on each of the sample. Results of each of these models are then aggregated together in following ways :

1. For Linear Regression model : Simple average of all the models are are considered to get the resultant model

Bagging - Linear Regression

2. Classification Model : Each of the model vote in terms of response and then based on majority, final response is selected.

Bagging - Classification Model

In either way, an equal weight is given to each of the models' result while aggregating.

Bagging helps in reducing the variance of error, but not the bias.

Let's now explore this dimension of the concept in little details.

The concept is not at all new, it was earlier used in the world of Surveying; there it was about Precision and accuracy. During my Mining Engineering studies, I learned about it.

Let's consider 4 archers aiming at target board and following are their hit pattern :

Archer A:  Hitting close to bulls eye, but all his hits are quite distant and scattered. He is having a low bias/error and high variance ( Can also be called High Accuracy and Less Precision)
Such guy might win a gold in Olympics, but won't be a constant performer.

Archer B:  Hitting far from bulls eye, and all his hits are quite distant and scattered. He is having a high bias/error and high variance ( Can also be called Less Accuracy and Less Precision)
He is an amateur.

Archer C:  Hitting close to bulls eye, and all his hits are quite close to each other. He is having a low bias/error and low variance ( Can also be called High Accuracy and High Precision)
He is a true performer!

Archer D:  Hitting far from bulls eye, but all his hits are quite close to each other. He is having a high bias/error and low variance ( Can also be called Less Accuracy and High Precision)
The guy can be trained and then made a good performer.

Similarly, we expect our model to be like archer C - Less in Bias/error and less in variance, such model is called as robust model.

Bagging might graduate a model from Archer B to Archer D, but doesn't make it Archer C. If it is already an Archer A, then it might become Archer C.

Random Forest

Random Forest is an evolved version of Bagging technique with a twist of randomization. Even Bagging contains a flavor of randomization in sampling, but in Random Forest, there is an additional component of randomization : Random selection of pool of variables at every node.

Random Forest we have already explained in greater details in one of our articles :

Random Forest in R

Boosting

Boosting is another ensemble learning technique in which models are built in iterative manner, with every successive model considering the errors of previous model, and try to make the model better in each of the successive iterations. The concept of boosting is that a weak learner can be improved to make a strong learner.Boosting technique is designed with a goal of bias/error minimization.

There are various boosting algorithms, of which most popular are AdaBoost and Gradient Boosting.

AdaBoost = Adaptive Boosting

AdaBoost is the boosting algorithm designed for classification problems. In this algorithm, first a basic classification model (tree) is built and then each of the next tree is built in consideration of the mis-classified observation in the previous tree.

A higher weight is given to the mis-classified observations ( also called hard to classify observations) in the successive tree and successive model tries to classify those observations correctly. Finally at the end outcomes of all the tress are aggregated giving more weights to the tree having better prediction.

By the virtue of design itself, boosting might lead to an over fitting problem and hence number of iteration should be controlled. With each iteration model becomes more complex and might give better result, but the goodness might be limited to training data only. The model might not hold good on the validation data. Hence it is suggested to limit the number of iterations such a way that neither the model is under-fit not it is over fit.

The second algorithm in boosting is Gradient Boosting which is meant for statistical models.

Gradient Boosting = Gradient Descent + Boosting

Both AdaBoost and gradient boosting follow the same fundamental idea: Both algorithms boost the performance of a simple base-learner by iteratively shifting the focus towards problematic observations that are difficult to predict. With AdaBoost, this shift is done by up-weighting observations that were misclassified before. Gradient boosting identifies difficult observations by large residuals computed in the previous iterations.

Next article will cover Gradient Boosting technique in details , especially "how to do" part.

Humble appeal: