Ask Analytics: Decision Tree With Party Package

Decision Science

I still remember those olden golden days, when we did not have many options for anything, be it Biscuits or the Decision Tree. Generally people used to buy Parle-G biscuits in India and used to draw decision tree with SAS or SPSS.

These days market is full of assortment, which is both bad and good. Bad because people often get confused with too many choices, good because you don't have to compromise with lack of options. Data science has got evolved much with assortment availability. R itself provides you many packages to do the same work. Let's learn the beautiful party package for building decision tree and enjoy the power of assortment! Read it and you would understand the significance of its name.

In the previous article on decision tree we covered "How to build a decision tree using tree package?"

Decision Tree in R with {tree} Package

Everything remains same here, except the package ...

For the demonstration of decision tree with {tree} package, we would use a data Carseats which is inbuilt in the package ISLR. Let's first get the data.

# As usual, we first clean the environment
rm(list = ls())
# install the package if not already done
if (!require(ISLR)) install.packages("ISLR")
library(ISLR)
# let's make a copy of Carseats data into data_1
data_1 = Carseats

# Now let's install the package we need to focus in this article
if (!require(party)) install.packages("party")
library(party)

# Let's first feel the data
head(data_1)

About the data
It is a simulated data having sales of child car seats at 400 different stores.
Variables description
Sales : Unit sales (in thousands) at each location
CompPrice : Price charged by competitor at each location
Income : Community income level (in thousands of dollars)
Advertising : Local advertising budget for company at each location (in thousands of dollars)
Population : Population size in region (in thousands)
Price : Price company charges for car seats at each site
ShelveLoc : A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
Age : Average age of the local population
Education : Education level at each location
Urban : A factor with levels No and Yes to indicate whether the store is in an urban or rural location
US : A factor with levels No and Yes to indicate whether the store is in the US or not

Here the objective of decision tree will be to understand the sales pattern over rest of the variables and find out which are the variables that drive the sales.

For the purpose of making a classification tree, we first need to covert the Sales variable into a binary variable. Let's see how the sales in distributed :

hist(data_1$Sales)

We can see that the Sales is normally distributed and its value lies between 0 to 16 and mode is around 8. Let's bifurcate the variable in following fashion:

# We make a variable "high" which is "yes" when sale is more than or equal to 8, "no" otherwise
data_1$high = as.factor(ifelse(data_1$Sales>=8,"yes","no"))
# Also be drop the original Sales variable
data_1$Sales = NULL

# Let's look at the data again
head(data_1)

As a standard matter of modeling practices, we break the data into two parts : Training data and Testing Data

Training data : The datasets on which we train/build our model
Testing data : It is also called validation data where we check how good our model is performing

Also there is a third data for which actual value of predicted (Y) variable is not known and we are interested in predicting this value. This is the data for which this model is being built. But for learning purpose, we focus on the these two datasets.

Here we are breaking data into 70:30 ratio randomly.

# It is important to set a seed, if you don't define a seed every time you run the program, a different set of random numbers would be generated ... so seed kinda fixes the randomization

set.seed(222)

train = sample(1:nrow(data_1), nrow(data_1)*0.7)

Training_data = data_1[train,]

Testing_data = data_1[-train,]

rm(data_1,train)

# So now we would train our model on the Training_data

party_tree = ctree(high~., Training_data)

plot(party_tree)

# you get this beautifully and visually explained classification tree

Let's try to read the decision tree :

1. Is branching has been done on ShelvelLoc with "Good" in one branch and "Bad + Medium") in other. The p value of the variable at node show the goodness of classification. Lesser the p value than 0.05, more confidence we have in the classification.

2. On the left side, the next branching has been done with Price variable. If the Price is <=134 then we have more than 80% "Yes" in the target high variable.

3. Node 7 also have concentration of "Yes" more than 80% and You can easily read the rule : ShelvelLoc = "Medium" and Price is <=96.

The tree is automatically quite optimized.

# Now let see how to predict the result on Test data

try_pred = predict(party_tree, Testing_data, type = "response") # Will give direct prediction

try_pred_prob = predict(party_tree, Testing_data, type = "prob")# Will give probability

# Check the classification rate

mean(try_pred !=Testing_data$high)

# it is 28%

# Check the mis-classification table