#Importing key libraries to be used in this project
import pandas as pd
import pandasql as ps
import numpy as np
import math as mt
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
from matplotlib import pyplot as plt
# Set the display options to show all columns
pd.set_option('display.max_columns', None)
# This is openly available data on many websites e.g. Kaggle
raw_data = pd.read_csv('input_dataset/titanic.csv')
raw_data.head(2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
# Checked the rows and columns in the data - feeling the data
raw_data.shape
(891, 12)
raw_data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
Module 0.Exploratory Data Analysis (EDA)¶
In our blog, we've taken on the intriguing challenge of predicting survival on the Titanic using the renowned Kaggle 'Titanic dataset.' This dataset comprises details of 891 passengers who embarked on that fateful journey in 1912. As we dive into this historical voyage, we uncover the factors that influenced survival, where gender, age, and social class played pivotal roles. Armed with Python and classification models, we embark on a data-driven journey to develop a predictive model that sheds light on who would have likely survived this iconic disaster. Join us as we navigate the waters of data science to unearth the hidden insights within this Titanic dataset
Following are the fields in the data:
-- Name (str) - Name of the passenger
-- Pclass (int) - Ticket class
-- Sex (str) - Sex of the passenger
-- Age (float) - Age in years
-- SibSp (int) - Number of siblings and spouses aboard
-- Parch (int) - Number of parents and children aboard
-- Ticket (str) - Ticket number
-- Fare (float) - Passenger fare
-- Cabin (str) - Cabin number
-- Embarked (str) - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
# Checking basisc stats of the data
raw_data.describe().T # T transposes the data
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
PassengerId | 891.0 | 446.000000 | 257.353842 | 1.00 | 223.5000 | 446.0000 | 668.5 | 891.0000 |
Survived | 891.0 | 0.383838 | 0.486592 | 0.00 | 0.0000 | 0.0000 | 1.0 | 1.0000 |
Pclass | 891.0 | 2.308642 | 0.836071 | 1.00 | 2.0000 | 3.0000 | 3.0 | 3.0000 |
Age | 714.0 | 29.699118 | 14.526497 | 0.42 | 20.1250 | 28.0000 | 38.0 | 80.0000 |
SibSp | 891.0 | 0.523008 | 1.102743 | 0.00 | 0.0000 | 0.0000 | 1.0 | 8.0000 |
Parch | 891.0 | 0.381594 | 0.806057 | 0.00 | 0.0000 | 0.0000 | 0.0 | 6.0000 |
Fare | 891.0 | 32.204208 | 49.693429 | 0.00 | 7.9104 | 14.4542 | 31.0 | 512.3292 |
# Checking the y-variable distribution
raw_data.Survived.value_counts(), raw_data.Survived.value_counts(normalize = True)
(0 549 1 342 Name: Survived, dtype: int64, 0 0.616162 1 0.383838 Name: Survived, dtype: float64)
plt.figure(figsize=(3,3))
raw_data.Survived.value_counts(normalize = True).plot(kind = 'pie',
autopct = '%.2f%%',
labels = ['0', '1'],
title = 'Distribution of Target Variable');
# Developed a modular function to create bar charts as required
def Bi_variate_analysis_plot(data,x,y):
plt.figure(figsize=(3,3))
xtab = pd.crosstab(index = data[x], columns = raw_data[y], normalize = 'columns')
xtab.plot.bar()
plt.ylabel("Count")
plt.legend()
return plt
Bi_variate_analysis_plot(data = raw_data ,x = 'Pclass',y = 'Survived')
<module 'matplotlib.pyplot' from 'C:\\Users\\ragarwal\\Anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py'>
<Figure size 300x300 with 0 Axes>
Bi_variate_analysis_plot(data = raw_data ,x = 'Parch',y = 'Survived')
<module 'matplotlib.pyplot' from 'C:\\Users\\ragarwal\\Anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py'>
<Figure size 300x300 with 0 Axes>
Bi_variate_analysis_plot(data = raw_data ,x = 'Embarked',y = 'Survived')
<module 'matplotlib.pyplot' from 'C:\\Users\\ragarwal\\Anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py'>
<Figure size 300x300 with 0 Axes>
# Let's see what missing values are we dealing with
raw_data.isna().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 5 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
We have missing values in Age, Sex, Embarked and Cabin field. Looks like the "Cabin" field is hardly having anny data.
Next step after EDA is data cleaning, which includes Missing value treatment, Outlier Treatment etc.¶
Module 1. Missing Value Treatment¶
#Also, let' make a copy of data
raw_data_stg1 = raw_data.copy()
There could be several methods to treat missing values:¶
> Simplest method is deleting the records with missing values - Not preferred if too many records are getting deleted¶
> Deleting the columns having too many missing values¶
> Imputing missing values with Mode (categorical variable), Mean/Median (continuous variable)¶
> Defining missing as a separate category itself (generally used for categorical variable)¶
> Imputation using machine learning itself - using KNN [not covered in this blog, will write a detailed blog for this later]¶
Below we have demonstrated most of the methods:
# Let's first treat missing value in 'sex' variable with Mode
raw_data_stg1.groupby('Sex')['Survived'].count()
Sex female 314 male 572 Name: Survived, dtype: int64
# Another way to find Mode
raw_data_stg1['Sex'].mode()[0]
'male'
# Mode of the varaibles in Male, let's impute it with Mode
raw_data_stg1['Sex_imputed_1'] = raw_data_stg1['Sex'].fillna(value = raw_data_stg1['Sex'].mode()[0])
raw_data_stg1.groupby('Sex_imputed_1')['Survived'].count()
Sex_imputed_1 female 314 male 577 Name: Survived, dtype: int64
# We have noticed that Name contains prefix such as "Mr.", "Mrs.","Ms.", "Master" etc.
# We can use these to impute the Sex variable
raw_data_stg1[['Sex','Name']].head(5)
Sex | Name | |
---|---|---|
0 | male | Braund, Mr. Owen Harris |
1 | female | Cumings, Mrs. John Bradley (Florence Briggs Th... |
2 | female | Heikkinen, Miss. Laina |
3 | female | Futrelle, Mrs. Jacques Heath (Lily May Peel) |
4 | male | Allen, Mr. William Henry |
raw_data_stg1['Salutation_temp'] = raw_data_stg1.Name.str.split(", ", expand = True)[1]
raw_data_stg1['Salutation'] = raw_data_stg1.Salutation_temp.str.split(" ", expand = True)[0]
raw_data_stg1[['Name','Salutation_temp','Salutation']].head()
Name | Salutation_temp | Salutation | |
---|---|---|---|
0 | Braund, Mr. Owen Harris | Mr. Owen Harris | Mr. |
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | Mrs. John Bradley (Florence Briggs Thayer) | Mrs. |
2 | Heikkinen, Miss. Laina | Miss. Laina | Miss. |
3 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | Mrs. Jacques Heath (Lily May Peel) | Mrs. |
4 | Allen, Mr. William Henry | Mr. William Henry | Mr. |
raw_data_stg1.groupby('Salutation')['Salutation'].count()
Salutation Capt. 1 Col. 2 Don. 1 Dr. 7 Jonkheer. 1 Lady. 1 Major. 2 Master. 40 Miss. 182 Mlle. 2 Mme. 1 Mr. 517 Mrs. 125 Ms. 1 Rev. 6 Sir. 1 the 1 Name: Salutation, dtype: int64
The above information, will not just help in imputing the missing values, but also can provide us more variables
raw_data_stg1['check'] = raw_data_stg1['Sex'].isnull()
check_missing = raw_data_stg1[raw_data_stg1.check == True]
check_missing
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Sex_imputed_1 | Salutation_temp | Salutation | check | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | NaN | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S | male | Mr. William Henry | Mr. | True |
21 | 22 | 1 | 2 | Beesley, Mr. Lawrence | NaN | 34.0 | 0 | 0 | 248698 | 13.0000 | D56 | S | male | Mr. Lawrence | Mr. | True |
26 | 27 | 0 | 3 | Emir, Mr. Farred Chehab | NaN | NaN | 0 | 0 | 2631 | 7.2250 | NaN | C | male | Mr. Farred Chehab | Mr. | True |
29 | 30 | 0 | 3 | Todoroff, Mr. Lalio | NaN | NaN | 0 | 0 | 349216 | 7.8958 | NaN | S | male | Mr. Lalio | Mr. | True |
42 | 43 | 0 | 3 | Kraeff, Mr. Theodor | NaN | NaN | 0 | 0 | 349253 | 7.8958 | NaN | C | male | Mr. Theodor | Mr. | True |
raw_data_stg1['Sex_imputed_2'] = raw_data_stg1['Sex']
raw_data_stg1.loc[raw_data_stg1['Salutation'].isin(['Mr.','Master.'])
& raw_data_stg1['Sex'].isnull() , 'Sex_imputed_2'] = 'male'
raw_data_stg1.loc[raw_data_stg1['Salutation'].isin(['Lady.','Miss.','Mrs.','Ms.'])
& raw_data_stg1['Sex'].isnull(), 'Sex_imputed_2'] = 'female'
raw_data_stg1['check'] = raw_data_stg1['Sex'].isnull()
check_missing = raw_data_stg1[raw_data_stg1.check == True]
check_missing[['Sex','Salutation','Sex_imputed_1','Sex_imputed_2']]
Sex | Salutation | Sex_imputed_1 | Sex_imputed_2 | |
---|---|---|---|---|
12 | NaN | Mr. | male | male |
21 | NaN | Mr. | male | male |
26 | NaN | Mr. | male | male |
29 | NaN | Mr. | male | male |
42 | NaN | Mr. | male | male |
# let's now check the data where age is missing
raw_data_stg1['check'] = raw_data_stg1['Age'].isnull()
check_missing = raw_data_stg1[raw_data_stg1.check == True]
check_missing.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Sex_imputed_1 | Salutation_temp | Salutation | check | Sex_imputed_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q | male | Mr. James | Mr. | True | male |
17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S | male | Mr. Charles Eugene | Mr. | True | male |
19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C | female | Mrs. Fatima | Mrs. | True | female |
26 | 27 | 0 | 3 | Emir, Mr. Farred Chehab | NaN | NaN | 0 | 0 | 2631 | 7.2250 | NaN | C | male | Mr. Farred Chehab | Mr. | True | male |
28 | 29 | 1 | 3 | O'Dwyer, Miss. Ellen "Nellie" | female | NaN | 0 | 0 | 330959 | 7.8792 | NaN | Q | female | Miss. Ellen "Nellie" | Miss. | True | female |
# Another usage of salutation variable - We can calculate the bucket wise median age for
# Master, Mr, Ms. Mrs. and can do a better imputation
raw_data_stg1['median_age'] = raw_data_stg1.groupby(['Salutation'])['Age'].transform('median')
raw_data_stg1['Age'] = raw_data_stg1['Age'].fillna(value = raw_data_stg1['median_age'])
raw_data_stg1.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Sex_imputed_1 | Salutation_temp | Salutation | check | Sex_imputed_2 | median_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | male | Mr. Owen Harris | Mr. | False | male | 30.0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | female | Mrs. John Bradley (Florence Briggs Thayer) | Mrs. | False | female | 35.0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | female | Miss. Laina | Miss. | False | female | 21.0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | female | Mrs. Jacques Heath (Lily May Peel) | Mrs. | False | female | 35.0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | male | Mr. William Henry | Mr. | False | male | 30.0 |
raw_data_stg1.isna().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 5 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 Sex_imputed_1 0 Salutation_temp 0 Salutation 0 check 0 Sex_imputed_2 0 median_age 0 dtype: int64
raw_data_stg1['Embarked'].value_counts()
S 644 C 168 Q 77 Name: Embarked, dtype: int64
raw_data_stg1['Embarked'].mode()[0]
'S'
# For categorical variables, let's crete missing as a seperate category
raw_data_stg1['Embarked'] = raw_data_stg1['Embarked'].fillna(value = raw_data_stg1['Embarked'].mode()[0])
For the last variable with missing values - Cabin, we can drop the variable, since there are too many missing values.
However, dropping a variable should be last resort, if We can get some information out of it
# We created a dummy variable - if it is present, calling it 1 else calling it 0
raw_data_stg1['Cabin_ind'] = np.where(raw_data_stg1['Cabin'].isnull(), 0, 1)
raw_data_stg1.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Sex_imputed_1 | Salutation_temp | Salutation | check | Sex_imputed_2 | median_age | Cabin_ind | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | male | Mr. Owen Harris | Mr. | False | male | 30.0 | 0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | female | Mrs. John Bradley (Florence Briggs Thayer) | Mrs. | False | female | 35.0 | 1 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | female | Miss. Laina | Miss. | False | female | 21.0 | 0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | female | Mrs. Jacques Heath (Lily May Peel) | Mrs. | False | female | 35.0 | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | male | Mr. William Henry | Mr. | False | male | 30.0 | 0 |
# This is just for demo purpose, we can use threshold method to drop a variables which has less populated values
# thresh - threshold for non-missing values
# Suppose, we want to keep varaibles which have at least 95% data
Threshold_cutoff = round(0.95* raw_data_stg1.shape[0],0)
Threshold_cutoff
x = raw_data_stg1.dropna(thresh = Threshold_cutoff, axis = 1)
x.shape
(891, 18)
# Not recommended, but following can be used to delete all the records with any missing values in any column
x = raw_data_stg1.dropna(axis = 0)
# axis = 0 means row wise operation, axis = 1 means column wise operation
raw_data_stg1.isna().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 5 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 0 Sex_imputed_1 0 Salutation_temp 0 Salutation 0 check 0 Sex_imputed_2 0 median_age 0 Cabin_ind 0 dtype: int64
raw_data_stg1.shape
(891, 19)
Module 2- Outlier Treatment¶
raw_data_stg2 = raw_data_stg1.copy()
# Seaborn is one of the commonly used libraries for data visualization using graphs and charts
import seaborn as sns
sns.set()
sns.set(style="darkgrid")
#generate histogram of all contionous variables
raw_data_stg2[['Age','Fare']].hist(figsize=(15,30),layout=(9,3));
# Let's first check age variable
sns.boxplot(data = raw_data_stg1, x = 'Age', orient='horizontal')
<AxesSubplot:xlabel='Age'>
Although, in the box plot, it is showing data to have outlier (data outside the whiskers), however as per common/business sense age 80 is totally possible and hence using the principle if a data point can be explained, it is not an outlier, we will not treat it.
sns.set(style="darkgrid")
sns.boxplot(data = raw_data_stg1, x = 'Fare', orient='horizontal')
<AxesSubplot:xlabel='Fare'>
Q1 = raw_data_stg2['Fare'].quantile(0.25)
Q3 = raw_data_stg2['Fare'].quantile(0.75)
whisker_1 = Q1 - (1.5*(Q3-Q1))
whisker_2 = Q3 + (1.5*(Q3-Q1))
whisker_1, whisker_2
(-26.724, 65.6344)
Fare_dist = ps.sqldf('''
select Fare
, count(Fare) as N
from raw_data_stg2
group by Fare
order by Fare desc
''')
Fare_dist.head()
Fare | N | |
---|---|---|
0 | 512.3292 | 3 |
1 | 263.0000 | 4 |
2 | 262.3750 | 2 |
3 | 247.5208 | 2 |
4 | 227.5250 | 4 |
# Using upper bound from the whisker plot, we can categorized the fare into "High" and "Low" categories
raw_data_stg2['High_FARE'] = 'No'
raw_data_stg2.loc[raw_data_stg2['Fare'] > 66 , 'High_FARE'] ='Yes'
raw_data_stg2.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Sex_imputed_1 | Salutation_temp | Salutation | check | Sex_imputed_2 | median_age | Cabin_ind | High_FARE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | male | Mr. Owen Harris | Mr. | False | male | 30.0 | 0 | No |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | female | Mrs. John Bradley (Florence Briggs Thayer) | Mrs. | False | female | 35.0 | 1 | Yes |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | female | Miss. Laina | Miss. | False | female | 21.0 | 0 | No |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | female | Mrs. Jacques Heath (Lily May Peel) | Mrs. | False | female | 35.0 | 1 | No |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | male | Mr. William Henry | Mr. | False | male | 30.0 | 0 | No |
cross_tab = pd.crosstab(index=raw_data_stg2['High_FARE'], columns=raw_data_stg2['Survived'])
print(cross_tab)
Survived 0 1 High_FARE No 512 263 Yes 37 79
In this case also, Looks like population with "High fare" is having signifactly higher survival rate, so it is not logical to treat the variable for outlier. We rather need to retain this information.
Looks like rich people got better opportunity for survival!
But suppose, we need to treat the outlier, it can be done following way.¶
This is just for demonstration purpose!
raw_data_stg2['Fare_demo'] =raw_data_stg2.Fare.apply(lambda x: whisker_2 if x > whisker_2
else whisker_1 if x < whisker_1
else x)
sns.set()
sns.set(style="darkgrid")
sns.boxplot(data = raw_data_stg2, x = 'Fare_demo', orient='horizontal')
<AxesSubplot:xlabel='Fare_demo'>
raw_data_stg2.head(2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Sex_imputed_1 | Salutation_temp | Salutation | check | Sex_imputed_2 | median_age | Cabin_ind | High_FARE | Fare_demo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | male | Mr. Owen Harris | Mr. | False | male | 30.0 | 0 | No | 7.2500 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | female | Mrs. John Bradley (Florence Briggs Thayer) | Mrs. | False | female | 35.0 | 1 | Yes | 65.6344 |
Module 3- Variables Transformation¶
The maximum time and efforts of any model building project should be spent in this phase only, one needs to be as creative as possible with feature engineering, to unlock maximum potential of the data.
Golden rule : the better the variables, the better the model
For classification Model, following are the key transformations used:¶
>>> For Nominal categorical variables, we create dummy variables - also called hot encoding, will show here¶
>>> For Ordinal categorical variables, we should create index varaibles - also called label encoding, will show here¶
>>> Continuous variables are transformed with either:¶
Binning method
Scaling method
>>> We should try to derive new variables by:¶
Logical derivations e.g. Age from DOB
Variables interactions
Smart variables with the help of a Deicison Tree
>>> Combinig sparse classes¶
raw_data_stg3 = raw_data_stg2.copy()
A. Combining Sparse Classes¶
Often while creating dummy variables for a categorical variables, we end-up getting too many variables. Suppose there are N classes, ideally we get N-1 dummy variables for it. What if N is 1000, we will get 999 varaibles, which is a very high number.
What can be done in order to control the # of variables?
- Method 1 : Using WOE method, as explained later in this module, we can combine the classes and reduce number of variables
- Method 2 : Combining sparse class logically
# In this data ,we don't have sparse classes so we are skipping it. In another project, we will demo it
# WOE concept and method has been explained a little later in this blog
B : Hot encoding for dummy variables creation for Nominal varaibles e.g. Sex_imputed_2¶
# In this step, you can either use Sex_imputed_2 , Sex_imputed_1 computed from above steps.
# as a matter of best practice, I am prefering Sex_imputed_2 since it is more rational and logical
raw_data_stg3['Sex'] = raw_data_stg3['Sex_imputed_2']
# dropping the variables
raw_data_stg3.drop(['Sex_imputed_2','Sex_imputed_1'], axis=1, inplace=True)
var_to_be_encoded = ['Sex','Embarked']
data_encoded = raw_data_stg3[var_to_be_encoded]
data_encoded = pd.get_dummies(data_encoded)
data_encoded.head(2)
Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 1 |
1 | 1 | 0 | 1 | 0 | 0 |
##Adding back the encoded dummy variables data to the main dataframe
raw_data_stg3 = pd.concat([raw_data_stg3,data_encoded], axis = 1)
raw_data_stg3.head(2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Salutation_temp | Salutation | check | median_age | Cabin_ind | High_FARE | Fare_demo | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Mr. Owen Harris | Mr. | False | 30.0 | 0 | No | 7.2500 | 0 | 1 | 0 | 0 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Mrs. John Bradley (Florence Briggs Thayer) | Mrs. | False | 35.0 | 1 | Yes | 65.6344 | 1 | 0 | 1 | 0 | 0 |
raw_data_stg3.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Salutation_temp', 'Salutation', 'check', 'median_age', 'Cabin_ind', 'High_FARE', 'Fare_demo', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S'], dtype='object')
C. Label encoding for ordinal varaibles¶
# As such source sould not be a ordinal variables, but for illustration purpose, let's consider it to be so
raw_data_stg3.Embarked.unique()
# In that case, instead of creating dummy varaibles, we should create an "Label encoded ordinal variable"
# If we create dummy for oridinal varaible, we lose out "order" information
# For creating label encoded variables, we need to know the order of the values in the variable,
# Suppose in the variables "Embarked" --> Q > C > S, we can do following:
array(['S', 'C', 'Q'], dtype=object)
# This variables has been created just for the demo purpose
raw_data_stg3['Embarked_DEMO_Label_Encoding'] = raw_data_stg3.Embarked.map({'S' : 1,
'C' : 2,
'Q' : 3})
raw_data_stg3.head(2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Salutation_temp | Salutation | check | median_age | Cabin_ind | High_FARE | Fare_demo | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | Embarked_DEMO_Label_Encoding | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Mr. Owen Harris | Mr. | False | 30.0 | 0 | No | 7.2500 | 0 | 1 | 0 | 0 | 1 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Mrs. John Bradley (Florence Briggs Thayer) | Mrs. | False | 35.0 | 1 | Yes | 65.6344 | 1 | 0 | 1 | 0 | 0 | 2 |
raw_data_stg3.dtypes
PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object Salutation_temp object Salutation object check bool median_age float64 Cabin_ind int32 High_FARE object Fare_demo float64 Sex_female uint8 Sex_male uint8 Embarked_C uint8 Embarked_Q uint8 Embarked_S uint8 Embarked_DEMO_Label_Encoding int64 dtype: object
D. Logical derivation of the new variables¶
This is one of the most important part of machine learning exercise, where data scientist needs to be creative as well as domain savvy. In-depth knowledge of business domain can help data scientist to come up with powerful and logical variables.
With limited data in this case, we don't have much scope, but then "never say no"!
Let's try to derive at least one variable!
raw_data_stg3['Family'] = raw_data_stg3['SibSp'] + raw_data_stg3['Parch']
to_scale = raw_data_stg3[['Family']]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(to_scale)
scaled_data_M1 = pd.DataFrame(scaled_data, columns=['Family'])
scaled_data_M1 = scaled_data_M1.rename(columns = {'Family':'Family_Scaled'})
# Standard scaler is used to remove scale from the continuous numerical variables
raw_data_stg3 = pd.concat([raw_data_stg3,scaled_data_M1], axis = 1)
raw_data_stg3.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Salutation_temp | Salutation | check | median_age | Cabin_ind | High_FARE | Fare_demo | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | Embarked_DEMO_Label_Encoding | Family | Family_Scaled | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Mr. Owen Harris | Mr. | False | 30.0 | 0 | No | 7.2500 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0.059160 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Mrs. John Bradley (Florence Briggs Thayer) | Mrs. | False | 35.0 | 1 | Yes | 65.6344 | 1 | 0 | 1 | 0 | 0 | 2 | 1 | 0.059160 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | Miss. Laina | Miss. | False | 21.0 | 0 | No | 7.9250 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | -0.560975 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | Mrs. Jacques Heath (Lily May Peel) | Mrs. | False | 35.0 | 1 | No | 53.1000 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0.059160 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | Mr. William Henry | Mr. | False | 30.0 | 0 | No | 8.0500 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | -0.560975 |
E. Treat continuous varaibles - let's do "BINNING" on age and scaling on purchase value¶
There are multiple methods to create Bin
- Plot and decide based on your judgement
- Make small bin, and then group small bin on the basis of Weigth of Evidence, as explained by Naeem Siddiqui in his book - Credit Risk Scorecard
- Draw a decision tree, which give a better binning based on actual data
Method 1 : Plot and decide based on your judgement¶
max_value = raw_data_stg3['Age'].max()
print(max_value)
80.0
import matplotlib.pyplot as plt
# to enable the displaying of plots in the notebook itself
%matplotlib inline
# define title
plt.title('Age Distribution')
# define xlabel
plt.xlabel('Age')
# define ylabel
plt.ylabel('Frequency')
# plot the histogram
plt.hist(raw_data_stg3['Age'], bins=99, color='orange');
# Method 1
# Since there is not enough data after age 50, it became a "sparse class"
#, hence we clubbed all age above 50 into one cateogry
# Let's make a functions that can be resused to make Bins, we are avoiding using pd.cut method
def bin_formation():
for i in range(0,len(Bin_input)):
if i == len(Bin_input)-1:
pass
else:
start_value = Bin_input[i]
end_value = Bin_input[i+1]
label = str(start_value) + "_" + str(end_value)
Bin_output.append([start_value,end_value,label])
def bin_based_class(var):
bin_formation()
for i in range(0,len(Bin_output)):
start_value = Bin_output[i][0]
end_value = Bin_output[i][1]
label_value = Bin_output[i][2]
if var >= start_value and var <= end_value:
var_output = label_value
return(var_output)
# For using above function, we need to define two objects:
# Bin_input - a list with desired cut values
# Bin_output - a Blank List
#NOTE : This function is a copyright of Ask Analytics
Bin_input = [0,20,40,60,100]
Bin_output =[]
raw_data_stg3['age_bins_M1'] = raw_data_stg3['Age'].apply(bin_based_class)
raw_data_stg3[['Age','age_bins_M1']].head(5)
Age | age_bins_M1 | |
---|---|---|
0 | 22.0 | 20_40 |
1 | 38.0 | 20_40 |
2 | 26.0 | 20_40 |
3 | 35.0 | 20_40 |
4 | 35.0 | 20_40 |
Method 2 - using Weight of evidence method, for which first we define the smaller bins, and then we can club micro- bins into macro-bins¶
# Let's first define micro-bins
Bin_input = []
for i in range (0,100,10):
Bin_input.append(i)
Bin_output =[]
print(Bin_input)
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
raw_data_stg3['age_bins_M2'] = raw_data_stg3['Age'].apply(bin_based_class)
raw_data_stg3[['Age','age_bins_M2']].head(5)
Age | age_bins_M2 | |
---|---|---|
0 | 22.0 | 20_30 |
1 | 38.0 | 30_40 |
2 | 26.0 | 20_30 |
3 | 35.0 | 30_40 |
4 | 35.0 | 30_40 |
# Let's now calculate the WOE table:
WOE_1 = ps.sqldf('''
select age_bins_M2
, sum(case when Survived = 1 then 1 else 0 END) as GOOD_NUM
, sum(case when Survived = 0 then 1 else 0 END) as BAD_NUM
from raw_data_stg3
group by age_bins_M2
''')
WOE_1
age_bins_M2 | GOOD_NUM | BAD_NUM | |
---|---|---|---|
0 | 0_10 | 40 | 28 |
1 | 10_20 | 44 | 71 |
2 | 20_30 | 120 | 265 |
3 | 30_40 | 83 | 89 |
4 | 40_50 | 33 | 54 |
5 | 50_60 | 17 | 25 |
6 | 60_70 | 4 | 13 |
7 | 70_80 | 1 | 4 |
sum_good = WOE_1.GOOD_NUM.sum()
sum_bad = WOE_1.BAD_NUM.sum()
sum_good, sum_bad
(342, 549)
WOE_1['Good_precent'] = WOE_1.GOOD_NUM.apply(lambda x: round(x/sum_good*100,1))
WOE_1['Bad_precent']= WOE_1.BAD_NUM.apply(lambda x: round(x/sum_bad*100,1))
WOE_1
age_bins_M2 | GOOD_NUM | BAD_NUM | Good_precent | Bad_precent | |
---|---|---|---|---|---|
0 | 0_10 | 40 | 28 | 11.7 | 5.1 |
1 | 10_20 | 44 | 71 | 12.9 | 12.9 |
2 | 20_30 | 120 | 265 | 35.1 | 48.3 |
3 | 30_40 | 83 | 89 | 24.3 | 16.2 |
4 | 40_50 | 33 | 54 | 9.6 | 9.8 |
5 | 50_60 | 17 | 25 | 5.0 | 4.6 |
6 | 60_70 | 4 | 13 | 1.2 | 2.4 |
7 | 70_80 | 1 | 4 | 0.3 | 0.7 |
def WOE_CALC(x,y):
if y == 0:
z = 0
else:
z = mt.log(x / y)
return(z)
WOE_CALC(20,2)
2.302585092994046
WOE_1['WOE'] = WOE_1.apply(lambda row: WOE_CALC(row['Good_precent'], row['Bad_precent']), axis=1)
WOE_1
age_bins_M2 | GOOD_NUM | BAD_NUM | Good_precent | Bad_precent | WOE | |
---|---|---|---|---|---|---|
0 | 0_10 | 40 | 28 | 11.7 | 5.1 | 0.830348 |
1 | 10_20 | 44 | 71 | 12.9 | 12.9 | 0.000000 |
2 | 20_30 | 120 | 265 | 35.1 | 48.3 | -0.319230 |
3 | 30_40 | 83 | 89 | 24.3 | 16.2 | 0.405465 |
4 | 40_50 | 33 | 54 | 9.6 | 9.8 | -0.020619 |
5 | 50_60 | 17 | 25 | 5.0 | 4.6 | 0.083382 |
6 | 60_70 | 4 | 13 | 1.2 | 2.4 | -0.693147 |
7 | 70_80 | 1 | 4 | 0.3 | 0.7 | -0.847298 |
WOE_1['Information_Value'] = (WOE_1['Good_precent'] - WOE_1['Bad_precent'])* WOE_1['WOE']/100
WOE_1
age_bins_M2 | GOOD_NUM | BAD_NUM | Good_precent | Bad_precent | WOE | Information_Value | |
---|---|---|---|---|---|---|---|
0 | 0_10 | 40 | 28 | 11.7 | 5.1 | 0.830348 | 0.054803 |
1 | 10_20 | 44 | 71 | 12.9 | 12.9 | 0.000000 | 0.000000 |
2 | 20_30 | 120 | 265 | 35.1 | 48.3 | -0.319230 | 0.042138 |
3 | 30_40 | 83 | 89 | 24.3 | 16.2 | 0.405465 | 0.032843 |
4 | 40_50 | 33 | 54 | 9.6 | 9.8 | -0.020619 | 0.000041 |
5 | 50_60 | 17 | 25 | 5.0 | 4.6 | 0.083382 | 0.000334 |
6 | 60_70 | 4 | 13 | 1.2 | 2.4 | -0.693147 | 0.008318 |
7 | 70_80 | 1 | 4 | 0.3 | 0.7 | -0.847298 | 0.003389 |
We can also calculate the IV here to judge whether is variable is a strong variable or not:
http://www.askanalytics.in/2015/09/concept-of-woe-and-iv.html
WOE_1['Information_Value'].sum()
# value of 0.14 means it is a very strong variable
0.14186580109668284
Here's a general guideline to interpret the strength of correlation based on IV values:
WOE_1 < 0.02: Indicates no practical predictive power. 0.02 ≤ WOE_1 < 0.1: Suggests weak predictive power. 0.1 ≤ WOE_1 < 0.3: Suggests moderate predictive power. 0.3 ≤ WOE_1 < 0.5: Indicates strong predictive power. WOE_1 ≥ 0.5: Suggests very strong predictive power.
So, from above information value, we can take that age has moderate predictive power on survived people
#Let's plot the WOE
sns.barplot(x="age_bins_M2", y="WOE", data=WOE_1)
<AxesSubplot:xlabel='age_bins_M2', ylabel='WOE'>
# Using Above Plot, we can now Bin age into:
# O to 20
# 20 to 30
# 30 to 40
# 40 to 60
# 60 and above
Bin_input = [0,20,30,40,60,90]
Bin_output =[]
raw_data_stg3['age_bins_M2'] = raw_data_stg3['Age'].apply(bin_based_class)
raw_data_stg3[['Age','age_bins_M2']].head(5)
Age | age_bins_M2 | |
---|---|---|
0 | 22.0 | 20_30 |
1 | 38.0 | 30_40 |
2 | 26.0 | 20_30 |
3 | 35.0 | 30_40 |
4 | 35.0 | 30_40 |
raw_data_stg3.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Salutation_temp | Salutation | check | median_age | Cabin_ind | High_FARE | Fare_demo | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | Embarked_DEMO_Label_Encoding | Family | Family_Scaled | age_bins_M1 | age_bins_M2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Mr. Owen Harris | Mr. | False | 30.0 | 0 | No | 7.2500 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0.059160 | 20_40 | 20_30 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Mrs. John Bradley (Florence Briggs Thayer) | Mrs. | False | 35.0 | 1 | Yes | 65.6344 | 1 | 0 | 1 | 0 | 0 | 2 | 1 | 0.059160 | 20_40 | 30_40 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | Miss. Laina | Miss. | False | 21.0 | 0 | No | 7.9250 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | -0.560975 | 20_40 | 20_30 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | Mrs. Jacques Heath (Lily May Peel) | Mrs. | False | 35.0 | 1 | No | 53.1000 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0.059160 | 20_40 | 30_40 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | Mr. William Henry | Mr. | False | 30.0 | 0 | No | 8.0500 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | -0.560975 | 20_40 | 30_40 |
Method 3 - Binning with the help of decision tree model¶
#seperating independent and dependent variables
y = raw_data_stg3['Survived']
x = raw_data_stg3['Age'].values.reshape(-1,1)
#importing decision tree classifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
#creating the decision tree function
dt_model = DecisionTreeClassifier(random_state=10, max_depth=4)
#fitting the model
model = dt_model.fit(x, y)
plt.figure(figsize=(12,12))
tree.plot_tree(dt_model)
plt.show()
From Above Tree, we can identify the optimal bucket for age to be <=10, 30 to 50 and More than 50¶
Bin_input = [0,10,30,50,100]
Bin_output =[]
raw_data_stg3['age_bins_M3'] = raw_data_stg3['Age'].apply(bin_based_class)
raw_data_stg3[['Age','age_bins_M3']].head(5)
Age | age_bins_M3 | |
---|---|---|
0 | 22.0 | 10_30 |
1 | 38.0 | 30_50 |
2 | 26.0 | 10_30 |
3 | 35.0 | 30_50 |
4 | 35.0 | 30_50 |
raw_data_stg3.age_bins_M3.unique()
array(['10_30', '30_50', '50_100', '0_10'], dtype=object)
# Having learnt 3 methods for Binning, let use method 3 Bins. We now need to label encode the bin
# since variable is ordinal in nature
raw_data_stg3['age_encoded'] = raw_data_stg3.age_bins_M3.map({'0_10' : 1,
'10_30' : 2,
'30_50' : 3,
'50_100' : 4})
raw_data_stg3.head(2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Salutation_temp | Salutation | check | median_age | Cabin_ind | High_FARE | Fare_demo | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | Embarked_DEMO_Label_Encoding | Family | Family_Scaled | age_bins_M1 | age_bins_M2 | age_bins_M3 | age_encoded | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Mr. Owen Harris | Mr. | False | 30.0 | 0 | No | 7.2500 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0.05916 | 20_40 | 20_30 | 10_30 | 2 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Mrs. John Bradley (Florence Briggs Thayer) | Mrs. | False | 35.0 | 1 | Yes | 65.6344 | 1 | 0 | 1 | 0 | 0 | 2 | 1 | 0.05916 | 20_40 | 30_40 | 30_50 | 3 |
F. Variable Scaling¶
There are two popular methods for scaling
- Standard Scaling (using Mean and Standard deviation) - In this method output values can be both -ve and +ve
- Min-Max Scaling - value in this method comes in range of 0 to 1
Maths behind Standard Scaling Z = (X - Mean) / Standard Deviation
Maths behind Min-Max Scaling K = (X - Min) / (Max - min)
Method 1 - Standard Scaling¶
to_scale = raw_data_stg3[['Fare']]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(to_scale)
scaled_data_M1 = pd.DataFrame(scaled_data, columns=['Fare_M1'])
scaled_data_M1.head()
Fare_M1 | |
---|---|
0 | -0.502445 |
1 | 0.786845 |
2 | -0.488854 |
3 | 0.420730 |
4 | -0.486337 |
Method 2 - Min-Max Scaling¶
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(to_scale)
scaled_data_M2 = pd.DataFrame(scaled_data, columns=['Fare_M2'])
scaled_data_M2.head()
Fare_M2 | |
---|---|
0 | 0.014151 |
1 | 0.139136 |
2 | 0.015469 |
3 | 0.103644 |
4 | 0.015713 |
# Scaling doesnt change the distribution of the variables, we can visualize it
both_scaled_vars = pd.concat([scaled_data_M1,scaled_data_M2],axis = 1)
both_scaled_vars.shape
(891, 2)
sns.relplot(x="Fare_M1", y="Fare_M2", data=both_scaled_vars, kind="scatter")
# You will find the variables are 100% correlated
<seaborn.axisgrid.FacetGrid at 0x251f155bac0>
raw_data_stg3.shape, both_scaled_vars.shape
((891, 31), (891, 2))
raw_data_stg3 = raw_data_stg3.reset_index(drop=True)
both_scaled_vars = both_scaled_vars.reset_index(drop=True)
raw_data_stg3 = pd.concat([raw_data_stg3,both_scaled_vars], axis = 1)
temp = raw_data_stg3[['Survived','Pclass','age_encoded','Family_Scaled','Fare_M2','Cabin_ind','Sex_female','Embarked_C','Embarked_Q']]
temp.head()
Survived | Pclass | age_encoded | Family_Scaled | Fare_M2 | Cabin_ind | Sex_female | Embarked_C | Embarked_Q | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 2 | 0.059160 | 0.014151 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 3 | 0.059160 | 0.139136 | 1 | 1 | 1 | 0 |
2 | 1 | 3 | 2 | -0.560975 | 0.015469 | 0 | 1 | 0 | 0 |
3 | 1 | 1 | 3 | 0.059160 | 0.103644 | 1 | 1 | 0 | 0 |
4 | 0 | 3 | 3 | -0.560975 | 0.015713 | 0 | 0 | 0 | 0 |
corr_matrix = temp.corr()
# Plot the correlation matrix using seaborn
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm',annot_kws={"size": 10})
<AxesSubplot:>
# Since Pclass and Cabin_Ind are highly correlated, it is suggested not to take both variables in the model
# As it would lead to multicollinearity
F. Variables interaction¶
Often variables inherently are not strong explainers, but become very stronger varriables when combined with each other.
# This we will cover in some other blog
Module 4 - Model Creation¶
- Let's first build a logistic regression model as our first classification model and discuss in details
- We will build other models - Decision Tree, Random Forest, XGBoost, SVM etc.
The Logistic Regression Model¶
raw_data_stg3.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Salutation_temp', 'Salutation', 'check', 'median_age', 'Cabin_ind', 'High_FARE', 'Fare_demo', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Embarked_DEMO_Label_Encoding', 'Family', 'Family_Scaled', 'age_bins_M1', 'age_bins_M2', 'age_bins_M3', 'age_encoded', 'Fare_M1', 'Fare_M2'], dtype='object')
# First let's select the variables to be used, removing extra varaibles that we creted for demonstration purpuse only
raw_data_stg4 = raw_data_stg3[['Pclass', 'Sex_female','Family_Scaled',
'age_encoded','Fare_M2','Survived']]
raw_data_stg4.head()
Pclass | Sex_female | Family_Scaled | age_encoded | Fare_M2 | Survived | |
---|---|---|---|---|---|---|
0 | 3 | 0 | 0.059160 | 2 | 0.014151 | 0 |
1 | 1 | 1 | 0.059160 | 3 | 0.139136 | 1 |
2 | 3 | 1 | -0.560975 | 2 | 0.015469 | 1 |
3 | 1 | 1 | 0.059160 | 3 | 0.103644 | 1 |
4 | 3 | 0 | -0.560975 | 3 | 0.015713 | 0 |
As a first step of modeling exercise, we need to create separate data for¶
- y variable
- x variables
x = raw_data_stg4.drop(['Survived'], axis =1)
y = raw_data_stg4['Survived']
x.shape, y.shape
((891, 5), (891,))
# Importing from sklearn, train test split function
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(x,y, test_size = 0.3, random_state = 45,stratify = y)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, random_state= 1234, test_size = 0.5)
- Train data - The data on which we will build the model
Test data - The data on which we will check the performance of model
test_size defines the proportion of data you want to keep for testing/validation
- random_state - If not defined, everytime you run the code, since selection is random, you get a new sample everytime. To make the sampling static, we use it
- stratify - Maintains the proportion of 1 and 0 same across the traing and test data
In the above code, we broke data into 3 portions:
- Train data - 70%
- Test data = 15%
- Validation data = 15%
Validation is to test FINAL model on completely unseen data
X_train.shape, X_test.shape, X_val.shape
((623, 5), (134, 5), (134, 5))
y_train.value_counts(), y_test.value_counts(), y_val.value_counts()
(0 384 1 239 Name: Survived, dtype: int64, 0 87 1 47 Name: Survived, dtype: int64, 0 78 1 56 Name: Survived, dtype: int64)
Before running the model, it is useful to remove the variables which have "Multicollinearity"¶
- If independent variables have relation(s) among each other, it is best practice to remove such variables
- In case of having Multicollinearity, the beta coefficients are not true representative of the relationship between y and X
#importing the model building and model evaluation libaries
import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, confusion_matrix, roc_curve
from sklearn.metrics import precision_score, recall_score, precision_recall_curve
from sklearn.feature_selection import RFE
# Creating an instance of Logistic Regresssion
Logistic_M = LogisticRegression()
# Fitting the Logistic Regresssion model
Logistic_model = Logistic_M.fit(X_train, y_train)
# Dumping the model in a pickle file so that it can be reuse it without the need to re-train it,
# and also can be deployed for production.
import joblib
joblib.dump(Logistic_model, 'Logistic_model.pkl')
['Logistic_model.pkl']
# To predict the model output, in term of 1-0
train_prediction_1_0 = Logistic_model.predict(X_train)
train_prediction_1_0[1:10]
array([1, 0, 1, 0, 1, 1, 0, 0, 1], dtype=int64)
# To predict the model output, in term of probability
train_prediction_p = Logistic_model.predict_proba(X_train)
train_prediction_p[1:10]
# Above gives probability of 0 and 1 respectively
array([[0.39238269, 0.60761731], [0.62347228, 0.37652772], [0.15249208, 0.84750792], [0.97047798, 0.02952202], [0.26806558, 0.73193442], [0.33167873, 0.66832127], [0.87780636, 0.12219364], [0.71050905, 0.28949095], [0.33153475, 0.66846525]])
# To see probability of 1:
train_prediction_1 = train_prediction_p[:,1]
train_prediction_1[1:10]
array([0.60761731, 0.37652772, 0.84750792, 0.02952202, 0.73193442, 0.66832127, 0.12219364, 0.28949095, 0.66846525])
# Accuracy of the model on training, test and validation data
print(Logistic_model.score(X_train, y_train)*100, Logistic_model.score(X_test, y_test)*100 )
80.73836276083468 76.86567164179104
# Another method to calculate the accuracy score
accuracy_score(y_train,Logistic_model.predict(X_train)), accuracy_score(y_test,Logistic_model.predict(X_test))
(0.8073836276083467, 0.7686567164179104)
# Create the RFE (Recursive Feature Elimination) object and rank each feature
rfe = RFE(estimator=Logistic_M, n_features_to_select=1, step = 1)
rfe.fit(X_train, y_train)
ranking_df = pd.DataFrame()
ranking_df['Feature_name'] = X_train.columns
ranking_df['Rank'] = rfe.ranking_
ranked = ranking_df.sort_values(by=['Rank'])
ranked
Feature_name | Rank | |
---|---|---|
1 | Sex_female | 1 |
0 | Pclass | 2 |
4 | Fare_M2 | 3 |
3 | age_encoded | 4 |
2 | Family_Scaled | 5 |
# We have built a modular function for printing the variables coefficient in the model
def print_variables_coeff(model,X):
df = pd.DataFrame(columns=['Variable','Coefficient'])
intercept = round(model.intercept_[0],3)
new_row = {'Variable' : 'Intercept'
, 'Coefficient' : intercept
}
df = df.append(new_row, ignore_index=True)
for i in range(len(X.columns)):
varaible_name = X.columns[i]
coefficient = round(model.coef_[0,i],3)
new_row = {'Variable' : varaible_name
, 'Coefficient' : coefficient
}
df = df.append(new_row, ignore_index=True)
return(df)
print_variables_coeff(Logistic_model,X_train)
Variable | Coefficient | |
---|---|---|
0 | Intercept | 1.863 |
1 | Pclass | -1.007 |
2 | Sex_female | 2.671 |
3 | Family_Scaled | -0.345 |
4 | age_encoded | -0.510 |
5 | Fare_M2 | 0.918 |
plt.figure(figsize=(9, 5), dpi=120, facecolor='w', edgecolor='b')
x = range(len(X_train.columns))
c = Logistic_model.coef_.reshape(-1)
plt.bar( x, c )
plt.xlabel( "Variables")
plt.ylabel('Coefficients')
plt.title('Coefficient plot')
plt.xticks(x, X_train.columns)
([<matplotlib.axis.XTick at 0x251f1ccd8b0>, <matplotlib.axis.XTick at 0x251f1ccd910>, <matplotlib.axis.XTick at 0x251f1cd83d0>, <matplotlib.axis.XTick at 0x251f1d1b0a0>, <matplotlib.axis.XTick at 0x251f1d1b760>], [Text(0, 0, 'Pclass'), Text(1, 0, 'Sex_female'), Text(2, 0, 'Family_Scaled'), Text(3, 0, 'age_encoded'), Text(4, 0, 'Fare_M2')])
# Calculating F1-score:
# The F1-score is a widely used metric in machine learning and statistics that combines
# both precision and recall into a single value.
# The F1-score ranges between 0 and 1, where a higher value indicates better model
print(f1_score(Logistic_model.predict(X_train), y_train)
, f1_score(Logistic_model.predict(X_test), y_test)
)
0.7413793103448277 0.651685393258427
from sklearn.metrics import confusion_matrix
cf = confusion_matrix(y_test, Logistic_model.predict(X_test))
print(cf)
[[74 13] [18 29]]
from sklearn.metrics import classification_report as rep
print(rep( y_test , Logistic_model.predict(X_test) ))
precision recall f1-score support 0 0.80 0.85 0.83 87 1 0.69 0.62 0.65 47 accuracy 0.77 134 macro avg 0.75 0.73 0.74 134 weighted avg 0.76 0.77 0.77 134
# Calculating accuracy at particular probability cut-off level
model = LogisticRegression()
model.fit(X_train, y_train)
train_probs = model.predict_proba(X_train)[:, 1]
train_preds = (train_probs >= 0.1).astype(int)
train_preds
cm = confusion_matrix(y_train,train_preds)
train_accuracy = accuracy_score(y_train, train_preds)
train_accuracy
0.49759229534510435
We have built a cool function that Iterates probability cut-off value from 0 to 1 by 0.1 and at every cut-off value calculates - precision, recall, f1 score, TPR, TNR, FPR, FNR, Accuracy, for train and test¶
def classification_table_AA(model,X, y):
df = pd.DataFrame(columns=['Threshold','TP', 'FP','FN','TN','Accuracy','Precision','TPR_Recall','F1_Score'])
# Create an array of probability thresholds from 0 to 1 by 0.1
thresholds = np.arange(0, 1.1, 0.1)
for threshold in thresholds:
# Predict probabilities for train, test, and validation sets
probabilities = model.predict_proba(X)[:, 1]
# Convert probabilities to binary predictions based on the threshold
predicted_values = (probabilities >= threshold).astype(int)
# Compute evaluation metrics
precision = precision_score(y, predicted_values)
recall = recall_score(y, predicted_values)
f1 = f1_score(y, predicted_values)
confusion = confusion_matrix(y, predicted_values)
# Compute additional metrics
tpr = recall
tnr = confusion[0, 0] / (confusion[0, 0] + confusion[0, 1])
fpr = 1 - tnr
fnr = 1 - tpr
accuracy = accuracy_score(y, predicted_values)
# Saving the information in a dataframe
new_row = {'Threshold' : threshold
, 'TP' : confusion[0][0]
, 'FP' : confusion[0][1]
, 'FN' : confusion[1][0]
, 'TN' : confusion[1][1]
, 'Accuracy' : accuracy
, 'Precision' : precision
, 'TPR_Recall' : recall
, 'F1_Score' : f1
}
df = df.append(new_row, ignore_index=True)
return df
train_result = classification_table_AA(Logistic_model,X_train, y_train)
train_result
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 384.0 | 0.0 | 239.0 | 0.383628 | 0.383628 | 1.000000 | 0.554524 |
1 | 0.1 | 80.0 | 304.0 | 9.0 | 230.0 | 0.497592 | 0.430712 | 0.962343 | 0.595084 |
2 | 0.2 | 242.0 | 142.0 | 36.0 | 203.0 | 0.714286 | 0.588406 | 0.849372 | 0.695205 |
3 | 0.3 | 291.0 | 93.0 | 41.0 | 198.0 | 0.784912 | 0.680412 | 0.828452 | 0.747170 |
4 | 0.4 | 315.0 | 69.0 | 56.0 | 183.0 | 0.799358 | 0.726190 | 0.765690 | 0.745418 |
5 | 0.5 | 331.0 | 53.0 | 67.0 | 172.0 | 0.807384 | 0.764444 | 0.719665 | 0.741379 |
6 | 0.6 | 352.0 | 32.0 | 85.0 | 154.0 | 0.812199 | 0.827957 | 0.644351 | 0.724706 |
7 | 0.7 | 376.0 | 8.0 | 130.0 | 109.0 | 0.778491 | 0.931624 | 0.456067 | 0.612360 |
8 | 0.8 | 379.0 | 5.0 | 150.0 | 89.0 | 0.751204 | 0.946809 | 0.372385 | 0.534535 |
9 | 0.9 | 381.0 | 3.0 | 201.0 | 38.0 | 0.672552 | 0.926829 | 0.158996 | 0.271429 |
10 | 1.0 | 384.0 | 0.0 | 239.0 | 0.0 | 0.616372 | 0.000000 | 0.000000 | 0.000000 |
How to read and interpret above table:¶
- If data is imbalanced - We either check F1 Score (Higher is better), or we try to make the data balanced by methods such as under-sampling, over-sampling
- Accuracy = (correctly predicted)/ total observations
- Precision - Precision is important, when
- Recall - This is also called TPR (True Positive rate)
We can decide the probability cut off value as per our need by analyzing the above table.
test_result = classification_table_AA(Logistic_model,X_test, y_test)
test_result
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 87.0 | 0.0 | 47.0 | 0.350746 | 0.350746 | 1.000000 | 0.519337 |
1 | 0.1 | 13.0 | 74.0 | 3.0 | 44.0 | 0.425373 | 0.372881 | 0.936170 | 0.533333 |
2 | 0.2 | 46.0 | 41.0 | 9.0 | 38.0 | 0.626866 | 0.481013 | 0.808511 | 0.603175 |
3 | 0.3 | 58.0 | 29.0 | 14.0 | 33.0 | 0.679104 | 0.532258 | 0.702128 | 0.605505 |
4 | 0.4 | 68.0 | 19.0 | 15.0 | 32.0 | 0.746269 | 0.627451 | 0.680851 | 0.653061 |
5 | 0.5 | 74.0 | 13.0 | 18.0 | 29.0 | 0.768657 | 0.690476 | 0.617021 | 0.651685 |
6 | 0.6 | 82.0 | 5.0 | 19.0 | 28.0 | 0.820896 | 0.848485 | 0.595745 | 0.700000 |
7 | 0.7 | 86.0 | 1.0 | 22.0 | 25.0 | 0.828358 | 0.961538 | 0.531915 | 0.684932 |
8 | 0.8 | 87.0 | 0.0 | 28.0 | 19.0 | 0.791045 | 1.000000 | 0.404255 | 0.575758 |
9 | 0.9 | 87.0 | 0.0 | 37.0 | 10.0 | 0.723881 | 1.000000 | 0.212766 | 0.350877 |
10 | 1.0 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
# Ask Analytics designed a proprietary function for ploting ROC curve - it's our patent!
from sklearn.metrics import roc_curve
def ROC_Curve (model,y_train,X_train,y_test,X_test):
probabilities_train = model.predict_proba(X_train)[:, 1]
probabilities_test = model.predict_proba(X_test)[:, 1]
fpr_train, tpr_train, _ = roc_curve(y_train, probabilities_train)
auc_train = roc_auc_score(y_train, probabilities_train)
fpr_test, tpr_test, _ = roc_curve(y_test, probabilities_test)
auc_test = roc_auc_score(y_test, probabilities_test)
plt.figure(figsize=(9,5))
plt.plot(fpr_train,tpr_train,label = "Train AUC-ROC="+str(auc_train))
plt.plot(fpr_test,tpr_test,label = "Test AUC-ROC="+str(auc_test))
x = np.linspace(0, 1, 1000)
plt.plot(x, x, linestyle='-')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=4)
plt.show()
# You can plot the ROC curve for any model
ROC_Curve (Logistic_model,y_train,X_train,y_test,X_test)
Perform Hyper-parameter Tuning on Logistic Regression¶
As a matter of practice, we first try using multiple methods of machine learning. Once we finalize the model, then we do hyper-parameter tuning on the final model. Here for demo purpose, we have done it on all models.
# Function to print the cross validation output in readable format
def CV_Print(results):
df = pd.DataFrame(columns=['Params','Means','Stds'])
x = results.best_params_
means = results.cv_results_['mean_test_score']
stds = results.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, results.cv_results_['params']):
new_row = {'Params' : params
, 'Means' : mean
, 'Stds' : std
}
df = df.append(new_row, ignore_index=True)
return df
parameters = {'C': [0.001, 0.01, 0.1, 1, 10, 20, 30,40, 50, 100, 1000, 10000]}
cv_logistic = GridSearchCV(model, parameters, cv = 5)
cv_logistic.fit(X_train, y_train.values.ravel())
CV_Print(cv_logistic)
Params | Means | Stds | |
---|---|---|---|
0 | {'C': 0.001} | 0.616374 | 0.002591 |
1 | {'C': 0.01} | 0.723884 | 0.011665 |
2 | {'C': 0.1} | 0.800865 | 0.030033 |
3 | {'C': 1} | 0.800852 | 0.031379 |
4 | {'C': 10} | 0.800865 | 0.026854 |
5 | {'C': 20} | 0.800865 | 0.026854 |
6 | {'C': 30} | 0.800865 | 0.026854 |
7 | {'C': 40} | 0.802477 | 0.024361 |
8 | {'C': 50} | 0.802477 | 0.024361 |
9 | {'C': 100} | 0.802477 | 0.024361 |
10 | {'C': 1000} | 0.802477 | 0.024361 |
11 | {'C': 10000} | 0.802477 | 0.024361 |
Grid search within K-fold cross-validation is used in order to find the optimal hyper parameter settings for logistic regression that generates the best model on our data.
cv_logistic.best_estimator_
LogisticRegression(C=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=40)
Logistic_model_cv = cv_logistic.fit(X_train, y_train.values.ravel())
Logistic_model_cv.score(X_train, y_train), Logistic_model_cv.score(X_test, y_test)
(0.8073836276083467, 0.753731343283582)
# Let's use the function we built already to check the variables coefficients
# In CV model, we just need to add "best_estimator_" after model name in this
print_variables_coeff(Logistic_model_cv.best_estimator_,X_train)
Variable | Coefficient | |
---|---|---|
0 | Intercept | 1.696 |
1 | Pclass | -0.970 |
2 | Sex_female | 2.820 |
3 | Family_Scaled | -0.410 |
4 | age_encoded | -0.536 |
5 | Fare_M2 | 2.240 |
train_result = classification_table_AA(Logistic_model_cv,X_train, y_train)
train_result
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 384.0 | 0.0 | 239.0 | 0.383628 | 0.383628 | 1.000000 | 0.554524 |
1 | 0.1 | 93.0 | 291.0 | 12.0 | 227.0 | 0.513644 | 0.438224 | 0.949791 | 0.599736 |
2 | 0.2 | 243.0 | 141.0 | 36.0 | 203.0 | 0.715891 | 0.590116 | 0.849372 | 0.696398 |
3 | 0.3 | 293.0 | 91.0 | 44.0 | 195.0 | 0.783307 | 0.681818 | 0.815900 | 0.742857 |
4 | 0.4 | 314.0 | 70.0 | 56.0 | 183.0 | 0.797753 | 0.723320 | 0.765690 | 0.743902 |
5 | 0.5 | 330.0 | 54.0 | 66.0 | 173.0 | 0.807384 | 0.762115 | 0.723849 | 0.742489 |
6 | 0.6 | 350.0 | 34.0 | 84.0 | 155.0 | 0.810594 | 0.820106 | 0.648536 | 0.724299 |
7 | 0.7 | 373.0 | 11.0 | 125.0 | 114.0 | 0.781701 | 0.912000 | 0.476987 | 0.626374 |
8 | 0.8 | 379.0 | 5.0 | 149.0 | 90.0 | 0.752809 | 0.947368 | 0.376569 | 0.538922 |
9 | 0.9 | 381.0 | 3.0 | 190.0 | 49.0 | 0.690209 | 0.942308 | 0.205021 | 0.336770 |
10 | 1.0 | 384.0 | 0.0 | 239.0 | 0.0 | 0.616372 | 0.000000 | 0.000000 | 0.000000 |
# You can plot the ROC curve at any probability cut-off
ROC_Curve (Logistic_model_cv,y_train,X_train,y_test,X_test)
train_probs = Logistic_model_cv.predict_proba(X_train)[:, 1]
train_probs[1:10]
array([0.83039557, 0.39157563, 0.8564838 , 0.02282676, 0.75015313, 0.68953242, 0.1166457 , 0.27666506, 0.68987168])
Decision Tree and Prune the tree with Hyperparameter Tunning¶
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier(random_state = 1234)
DT_model = DT.fit(X_train, y_train)
DT_model.score(X_train, y_train), DT_model.score(X_test, y_test)
(0.9454253611556982, 0.8208955223880597)
# Let's visualize the Tree Model, it's always relaxing to watch natural scenery
plt.figure(figsize = (50,40))
text_params = {'fontsize': 10}
tree.plot_tree(DT_model,
feature_names = list(X_train.columns),
class_names = ['0','1'],
filled = True);
plt.figure(figsize = (15,15))
text_params = {'fontsize': 12}
tree.plot_tree(DT_model,
feature_names = list(X_train.columns),
class_names = ['0','1'],
filled = True
, max_depth = 2);
# Check the variables which are coming as most important variables in the model
importance = DT_model.feature_importances_
feature_importance = pd.Series(importance, index = X_train.columns)
feature_importance = feature_importance.sort_values(ascending= [False])
feature_importance.plot(kind = 'bar')
plt.ylabel('Importance')
Text(0, 0.5, 'Importance')
Prune the classification tree with Hyperparameter Tunning¶
grid = {'max_depth': [2, 3, 4, 5],
'min_samples_split': [2, 3, 4],
'min_samples_leaf': range(5, 10)}
from sklearn.model_selection import GridSearchCV
classifier = DecisionTreeClassifier(random_state = 1234)
DT_CV = GridSearchCV(estimator = classifier, param_grid = grid)
DT_CV.fit(X_train, y_train)
GridSearchCV(estimator=DecisionTreeClassifier(random_state=1234), param_grid={'max_depth': [2, 3, 4, 5], 'min_samples_leaf': range(5, 10), 'min_samples_split': [2, 3, 4]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=DecisionTreeClassifier(random_state=1234), param_grid={'max_depth': [2, 3, 4, 5], 'min_samples_leaf': range(5, 10), 'min_samples_split': [2, 3, 4]})
DecisionTreeClassifier(random_state=1234)
DecisionTreeClassifier(random_state=1234)
DT_CV.best_estimator_
DecisionTreeClassifier(max_depth=5, min_samples_leaf=8, random_state=1234)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=5, min_samples_leaf=8, random_state=1234)
DT_model_CV = DT_CV.fit(X_train, y_train.values.ravel())
DT_model_CV.score(X_train, y_train), DT_model_CV.score(X_test, y_test)
(0.8491171749598716, 0.8059701492537313)
train_result = classification_table_AA(DT_model_CV,X_train, y_train)
train_result
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 384.0 | 0.0 | 239.0 | 0.383628 | 0.383628 | 1.000000 | 0.554524 |
1 | 0.1 | 105.0 | 279.0 | 4.0 | 235.0 | 0.545746 | 0.457198 | 0.983264 | 0.624170 |
2 | 0.2 | 303.0 | 81.0 | 31.0 | 208.0 | 0.820225 | 0.719723 | 0.870293 | 0.787879 |
3 | 0.3 | 315.0 | 69.0 | 35.0 | 204.0 | 0.833066 | 0.747253 | 0.853556 | 0.796875 |
4 | 0.4 | 315.0 | 69.0 | 35.0 | 204.0 | 0.833066 | 0.747253 | 0.853556 | 0.796875 |
5 | 0.5 | 345.0 | 39.0 | 55.0 | 184.0 | 0.849117 | 0.825112 | 0.769874 | 0.796537 |
6 | 0.6 | 345.0 | 39.0 | 55.0 | 184.0 | 0.849117 | 0.825112 | 0.769874 | 0.796537 |
7 | 0.7 | 369.0 | 15.0 | 97.0 | 142.0 | 0.820225 | 0.904459 | 0.594142 | 0.717172 |
8 | 0.8 | 373.0 | 11.0 | 107.0 | 132.0 | 0.810594 | 0.923077 | 0.552301 | 0.691099 |
9 | 0.9 | 383.0 | 1.0 | 155.0 | 84.0 | 0.749599 | 0.988235 | 0.351464 | 0.518519 |
10 | 1.0 | 384.0 | 0.0 | 168.0 | 71.0 | 0.730337 | 1.000000 | 0.297071 | 0.458065 |
# You can plot the ROC curve at any probability cut-off
ROC_Curve (DT_model_CV,y_train,X_train,y_test,X_test)
Random Forest Model and perform Hyperparameter Tunning¶
from sklearn.ensemble import RandomForestClassifier
# let's build a forst with 500 trees
RF_model = RandomForestClassifier(n_estimators = 500, random_state = 42)
# Train the model on training data
RF_model.fit(X_train, y_train);
train_RF = RF_model.predict(X_train)
train_RF[1:10]
array([1, 0, 1, 0, 1, 0, 0, 0, 0], dtype=int64)
train_probs_RF = RF_model.predict_proba(X_train)[:,1]
train_probs_RF[1:10]
array([0.77 , 0.098 , 0.89440238, 0.041 , 0.976 , 0.36282056, 0.11171319, 0.01733333, 0.1383619 ])
train_result = classification_table_AA(RF_model,X_train, y_train)
train_result
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 384.0 | 0.0 | 239.0 | 0.383628 | 0.383628 | 1.000000 | 0.554524 |
1 | 0.1 | 264.0 | 120.0 | 3.0 | 236.0 | 0.802568 | 0.662921 | 0.987448 | 0.793277 |
2 | 0.2 | 306.0 | 78.0 | 6.0 | 233.0 | 0.865169 | 0.749196 | 0.974895 | 0.847273 |
3 | 0.3 | 345.0 | 39.0 | 13.0 | 226.0 | 0.916533 | 0.852830 | 0.945607 | 0.896825 |
4 | 0.4 | 369.0 | 15.0 | 19.0 | 220.0 | 0.945425 | 0.936170 | 0.920502 | 0.928270 |
5 | 0.5 | 372.0 | 12.0 | 22.0 | 217.0 | 0.945425 | 0.947598 | 0.907950 | 0.927350 |
6 | 0.6 | 376.0 | 8.0 | 26.0 | 213.0 | 0.945425 | 0.963801 | 0.891213 | 0.926087 |
7 | 0.7 | 382.0 | 2.0 | 51.0 | 188.0 | 0.914928 | 0.989474 | 0.786611 | 0.876457 |
8 | 0.8 | 382.0 | 2.0 | 69.0 | 170.0 | 0.886035 | 0.988372 | 0.711297 | 0.827251 |
9 | 0.9 | 384.0 | 0.0 | 105.0 | 134.0 | 0.831461 | 1.000000 | 0.560669 | 0.718499 |
10 | 1.0 | 384.0 | 0.0 | 210.0 | 29.0 | 0.662921 | 1.000000 | 0.121339 | 0.216418 |
train_result = classification_table_AA(RF_model,X_test, y_test)
train_result
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 87.0 | 0.0 | 47.0 | 0.350746 | 0.350746 | 1.000000 | 0.519337 |
1 | 0.1 | 46.0 | 41.0 | 9.0 | 38.0 | 0.626866 | 0.481013 | 0.808511 | 0.603175 |
2 | 0.2 | 59.0 | 28.0 | 10.0 | 37.0 | 0.716418 | 0.569231 | 0.787234 | 0.660714 |
3 | 0.3 | 68.0 | 19.0 | 11.0 | 36.0 | 0.776119 | 0.654545 | 0.765957 | 0.705882 |
4 | 0.4 | 74.0 | 13.0 | 11.0 | 36.0 | 0.820896 | 0.734694 | 0.765957 | 0.750000 |
5 | 0.5 | 76.0 | 11.0 | 11.0 | 36.0 | 0.835821 | 0.765957 | 0.765957 | 0.765957 |
6 | 0.6 | 79.0 | 8.0 | 16.0 | 31.0 | 0.820896 | 0.794872 | 0.659574 | 0.720930 |
7 | 0.7 | 81.0 | 6.0 | 18.0 | 29.0 | 0.820896 | 0.828571 | 0.617021 | 0.707317 |
8 | 0.8 | 83.0 | 4.0 | 21.0 | 26.0 | 0.813433 | 0.866667 | 0.553191 | 0.675325 |
9 | 0.9 | 84.0 | 3.0 | 29.0 | 18.0 | 0.761194 | 0.857143 | 0.382979 | 0.529412 |
10 | 1.0 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
# Get numerical feature importances
def feature_imp_RF(model,x):
df = pd.DataFrame(columns=['Variable','Importance'])
importances = list(model.feature_importances_)
col_list = list(x.columns)
for i in range(len(col_list)):
new_row = {'Variable' : col_list[i]
, 'Importance' : importances[i]
}
df = df.append(new_row, ignore_index=True)
return df
X = feature_imp_RF(RF_model,X_train)
X
Variable | Importance | |
---|---|---|
0 | Pclass | 0.093017 |
1 | Sex_female | 0.308676 |
2 | Family_Scaled | 0.095520 |
3 | age_encoded | 0.091763 |
4 | Fare_M2 | 0.411024 |
# Let's plot ROC curve for RF Model
ROC_Curve(RF_model,y_train,X_train,y_test,X_test)
This is a classical case of over-fitting model
- Model is fitting on train data quite pefectly, but not as good on test data
- Here "Bias" is low, but "Variance" is high
RF_CV = RandomForestClassifier()
parameters = {
'n_estimators': [5, 10, 25, 50, 250, 500],
'max_depth': [2, 4, 8, 16, 32, None]
}
RF_CV = GridSearchCV(RF_CV, parameters, cv = 5)
RF_CV.fit(X_train, y_train.values.ravel())
CV_Print(RF_CV)
Params | Means | Stds | |
---|---|---|---|
0 | {'max_depth': 2, 'n_estimators': 5} | 0.775277 | 0.024315 |
1 | {'max_depth': 2, 'n_estimators': 10} | 0.796090 | 0.017231 |
2 | {'max_depth': 2, 'n_estimators': 25} | 0.789703 | 0.019460 |
3 | {'max_depth': 2, 'n_estimators': 50} | 0.780103 | 0.010678 |
4 | {'max_depth': 2, 'n_estimators': 250} | 0.781639 | 0.019288 |
5 | {'max_depth': 2, 'n_estimators': 500} | 0.780026 | 0.019526 |
6 | {'max_depth': 4, 'n_estimators': 5} | 0.800955 | 0.008008 |
7 | {'max_depth': 4, 'n_estimators': 10} | 0.808968 | 0.006605 |
8 | {'max_depth': 4, 'n_estimators': 25} | 0.821832 | 0.011771 |
9 | {'max_depth': 4, 'n_estimators': 50} | 0.807355 | 0.017104 |
10 | {'max_depth': 4, 'n_estimators': 250} | 0.813819 | 0.013578 |
11 | {'max_depth': 4, 'n_estimators': 500} | 0.820206 | 0.014170 |
12 | {'max_depth': 8, 'n_estimators': 5} | 0.829819 | 0.019591 |
13 | {'max_depth': 8, 'n_estimators': 10} | 0.828271 | 0.011673 |
14 | {'max_depth': 8, 'n_estimators': 25} | 0.841123 | 0.013446 |
15 | {'max_depth': 8, 'n_estimators': 50} | 0.834684 | 0.016448 |
16 | {'max_depth': 8, 'n_estimators': 250} | 0.845935 | 0.013509 |
17 | {'max_depth': 8, 'n_estimators': 500} | 0.844310 | 0.009476 |
18 | {'max_depth': 16, 'n_estimators': 5} | 0.800929 | 0.020204 |
19 | {'max_depth': 16, 'n_estimators': 10} | 0.802529 | 0.016170 |
20 | {'max_depth': 16, 'n_estimators': 25} | 0.818619 | 0.003878 |
21 | {'max_depth': 16, 'n_estimators': 50} | 0.820206 | 0.009919 |
22 | {'max_depth': 16, 'n_estimators': 250} | 0.829858 | 0.011720 |
23 | {'max_depth': 16, 'n_estimators': 500} | 0.829819 | 0.017474 |
24 | {'max_depth': 32, 'n_estimators': 5} | 0.808981 | 0.017170 |
25 | {'max_depth': 32, 'n_estimators': 10} | 0.820245 | 0.027074 |
26 | {'max_depth': 32, 'n_estimators': 25} | 0.813794 | 0.012971 |
27 | {'max_depth': 32, 'n_estimators': 50} | 0.818606 | 0.019444 |
28 | {'max_depth': 32, 'n_estimators': 250} | 0.831458 | 0.005103 |
29 | {'max_depth': 32, 'n_estimators': 500} | 0.833071 | 0.009221 |
30 | {'max_depth': None, 'n_estimators': 5} | 0.805794 | 0.011494 |
31 | {'max_depth': None, 'n_estimators': 10} | 0.815419 | 0.014198 |
32 | {'max_depth': None, 'n_estimators': 25} | 0.815355 | 0.032781 |
33 | {'max_depth': None, 'n_estimators': 50} | 0.823419 | 0.013581 |
34 | {'max_depth': None, 'n_estimators': 250} | 0.828219 | 0.014331 |
35 | {'max_depth': None, 'n_estimators': 500} | 0.831445 | 0.009045 |
RF_CV.best_estimator_
RandomForestClassifier(max_depth=8, n_estimators=250)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_depth=8, n_estimators=250)
RF_model_CV = RF_CV.fit(X_train, y_train.values.ravel())
RF_model_CV.score(X_train, y_train), RF_model_CV.score(X_test, y_test)
(0.913322632423756, 0.7985074626865671)
train_result = classification_table_AA(RF_model_CV,X_test, y_test)
train_result
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 87.0 | 0.0 | 47.0 | 0.350746 | 0.350746 | 1.000000 | 0.519337 |
1 | 0.1 | 37.0 | 50.0 | 7.0 | 40.0 | 0.574627 | 0.444444 | 0.851064 | 0.583942 |
2 | 0.2 | 63.0 | 24.0 | 9.0 | 38.0 | 0.753731 | 0.612903 | 0.808511 | 0.697248 |
3 | 0.3 | 68.0 | 19.0 | 13.0 | 34.0 | 0.761194 | 0.641509 | 0.723404 | 0.680000 |
4 | 0.4 | 72.0 | 15.0 | 13.0 | 34.0 | 0.791045 | 0.693878 | 0.723404 | 0.708333 |
5 | 0.5 | 76.0 | 11.0 | 16.0 | 31.0 | 0.798507 | 0.738095 | 0.659574 | 0.696629 |
6 | 0.6 | 80.0 | 7.0 | 16.0 | 31.0 | 0.828358 | 0.815789 | 0.659574 | 0.729412 |
7 | 0.7 | 84.0 | 3.0 | 18.0 | 29.0 | 0.843284 | 0.906250 | 0.617021 | 0.734177 |
8 | 0.8 | 87.0 | 0.0 | 21.0 | 26.0 | 0.843284 | 1.000000 | 0.553191 | 0.712329 |
9 | 0.9 | 87.0 | 0.0 | 28.0 | 19.0 | 0.791045 | 1.000000 | 0.404255 | 0.575758 |
10 | 1.0 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
# Let's plot ROC curve for RF Model
ROC_Curve(RF_model_CV,y_train,X_train,y_test,X_test)
Plotting the 2D and 3D contour to visualize - This is jazzy, but not very useful¶
grid_results = pd.concat([pd.DataFrame(RF_CV.cv_results_["params"])
, pd.DataFrame(RF_CV.cv_results_["mean_test_score"]
, columns=["Accuracy"])],axis=1)
grid_results.head()
max_depth | n_estimators | Accuracy | |
---|---|---|---|
0 | 2.0 | 5 | 0.762477 |
1 | 2.0 | 10 | 0.794542 |
2 | 2.0 | 25 | 0.788026 |
3 | 2.0 | 50 | 0.776865 |
4 | 2.0 | 250 | 0.773613 |
grid_contour = grid_results.groupby(['max_depth','n_estimators']).mean()
grid_contour.head()
Accuracy | ||
---|---|---|
max_depth | n_estimators | |
2.0 | 5 | 0.762477 |
10 | 0.794542 | |
25 | 0.788026 | |
50 | 0.776865 | |
250 | 0.773613 |
### Pivoting the data:
grid_reset = grid_contour.reset_index()
grid_reset.columns = ['max_depth', 'n_estimators', 'Accuracy']
grid_pivot = grid_reset.pivot('max_depth', 'n_estimators')
grid_pivot
Accuracy | ||||||
---|---|---|---|---|---|---|
n_estimators | 5 | 10 | 25 | 50 | 250 | 500 |
max_depth | ||||||
2.0 | 0.762477 | 0.794542 | 0.788026 | 0.776865 | 0.773613 | 0.778465 |
4.0 | 0.808955 | 0.815342 | 0.821819 | 0.813794 | 0.808994 | 0.823432 |
8.0 | 0.825045 | 0.831471 | 0.841097 | 0.833071 | 0.841110 | 0.839497 |
16.0 | 0.808968 | 0.812116 | 0.820194 | 0.818594 | 0.831445 | 0.831419 |
32.0 | 0.813781 | 0.816994 | 0.821819 | 0.821871 | 0.831458 | 0.825006 |
x = grid_pivot.columns.levels[1].values
y = grid_pivot.index.values
z = grid_pivot.values
import plotly.graph_objects as gb
layout = gb.Layout(
xaxis=gb.layout.XAxis(
title=gb.layout.xaxis.Title(
text='n_estimators')
),
yaxis=gb.layout.YAxis(
title=gb.layout.yaxis.Title(
text='max_depth')
) )
fig = gb.Figure(data = [gb.Contour(z=z, x=x, y=y)], layout=layout )
fig.update_layout(title='Graph showing n_estimators v/s max_depth', autosize=False,
width=700, height=700,
margin=dict(l=65, r=50, b=65, t=90))
fig.show()
fig = gb.Figure(data= [gb.Surface(z=z, y=y, x=x)], layout=layout )
fig.update_layout(title='Hyperparameter Tuning',
scene = dict(
xaxis_title='n_estimators',
yaxis_title='max_depth',
zaxis_title='Accuracy'),
autosize=False,
width=800, height=800,
margin=dict(l=65, r=50, b=65, t=90))
fig.show()
Extreme Gradient Boosting (XGBoost) Model and perform Hyperparameter Tunning¶
# conda install -c conda-forge xgboost
# !pip install xgboost
from xgboost import XGBClassifier
XGB_cl = XGBClassifier()
model_XGB = XGB_cl.fit(X_train, y_train)
model_XGB.score(X_train, y_train), model_XGB.score(X_test, y_test)
(0.9309791332263242, 0.8208955223880597)
preds = model_XGB.predict(X_test)
preds[1:10]
array([0, 0, 0, 1, 0, 1, 0, 0, 1])
preds = model_XGB.predict_proba(X_test)[:,1]
preds[1:10]
array([8.8950247e-03, 4.6509650e-01, 1.9177219e-01, 9.9429333e-01, 5.5087014e-04, 9.8206317e-01, 4.4684451e-02, 1.3533361e-01, 9.2851740e-01], dtype=float32)
train_result = classification_table_AA(model_XGB,X_train, y_train)
train_result
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 384.0 | 0.0 | 239.0 | 0.383628 | 0.383628 | 1.000000 | 0.554524 |
1 | 0.1 | 255.0 | 129.0 | 5.0 | 234.0 | 0.784912 | 0.644628 | 0.979079 | 0.777409 |
2 | 0.2 | 320.0 | 64.0 | 14.0 | 225.0 | 0.874799 | 0.778547 | 0.941423 | 0.852273 |
3 | 0.3 | 352.0 | 32.0 | 19.0 | 220.0 | 0.918138 | 0.873016 | 0.920502 | 0.896130 |
4 | 0.4 | 363.0 | 21.0 | 25.0 | 214.0 | 0.926164 | 0.910638 | 0.895397 | 0.902954 |
5 | 0.5 | 370.0 | 14.0 | 29.0 | 210.0 | 0.930979 | 0.937500 | 0.878661 | 0.907127 |
6 | 0.6 | 372.0 | 12.0 | 32.0 | 207.0 | 0.929374 | 0.945205 | 0.866109 | 0.903930 |
7 | 0.7 | 378.0 | 6.0 | 49.0 | 190.0 | 0.911717 | 0.969388 | 0.794979 | 0.873563 |
8 | 0.8 | 381.0 | 3.0 | 68.0 | 171.0 | 0.886035 | 0.982759 | 0.715481 | 0.828087 |
9 | 0.9 | 384.0 | 0.0 | 109.0 | 130.0 | 0.825040 | 1.000000 | 0.543933 | 0.704607 |
10 | 1.0 | 384.0 | 0.0 | 239.0 | 0.0 | 0.616372 | 0.000000 | 0.000000 | 0.000000 |
# Let's plot ROC curve for RF Model
ROC_Curve(model_XGB,y_train,X_train,y_test,X_test)
importance = model_XGB.feature_importances_
feature_importance = pd.Series(importance, index = X_train.columns)
feature_importance = feature_importance.sort_values(ascending = False)
feature_importance.plot(kind = 'bar')
plt.ylabel('Importance');
- n_estimators: Total number of trees [Dealut value is 100]
- learning_rate:This determines the impact of each tree on the final outcome [Default value is 0.1, range 0 to 1]
- random_state: The random number seed so that same random numbers are generated every time
- max_depth: Maximum depth to which tree can grow (stopping criteria) [Default value is 6, more the value more overfitting it may lead to]
- subsample: The fraction of observations to be selected for each tree. Selection is done by random sampling [Default is 1, if we set it to 0.5, XGBoost would randomly sample 50% of training data prior to growing trees]
- objective: Defines Loss function (binary:logistic is for classification using probability, reg:logistic is for classification, reg:linear is for regression)
- colsample_bylevel: Random feature selection at levels [Default is 1 i.e. all varaibles, we can define 0 to 1]
- colsample_bytree: Subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed [Default is 1 i.e. all varaibles, we can define 0 to 1]
Gradient Descent tries to bring a balance between Bias and variance (minimize both) and brings the model to global minima of error.
XGB_model_CV = XGBClassifier()
parameters = {
'n_estimators': [5, 10, 25, 50, 250],
'max_depth': [2, 4, 8, 16, 32, None],
'colsample_bytree' : [0.1, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.6] ,
'objective' : ['binary:logistic']
}
XGB_model_CV = GridSearchCV(XGB_model_CV, parameters, cv=5)
XGB_model_CV.fit(X_train, y_train.values.ravel())
CV_Print(XGB_model_CV)
Params | Means | Stds | |
---|---|---|---|
0 | {'colsample_bytree': 0.1, 'max_depth': 2, 'n_e... | 0.703097 | 0.033965 |
1 | {'colsample_bytree': 0.1, 'max_depth': 2, 'n_e... | 0.714297 | 0.026990 |
2 | {'colsample_bytree': 0.1, 'max_depth': 2, 'n_e... | 0.786568 | 0.021365 |
3 | {'colsample_bytree': 0.1, 'max_depth': 2, 'n_e... | 0.804155 | 0.012322 |
4 | {'colsample_bytree': 0.1, 'max_depth': 2, 'n_e... | 0.824994 | 0.019001 |
... | ... | ... | ... |
265 | {'colsample_bytree': 0.6, 'max_depth': None, '... | 0.796194 | 0.013296 |
266 | {'colsample_bytree': 0.6, 'max_depth': None, '... | 0.812206 | 0.011853 |
267 | {'colsample_bytree': 0.6, 'max_depth': None, '... | 0.831445 | 0.012594 |
268 | {'colsample_bytree': 0.6, 'max_depth': None, '... | 0.821781 | 0.013500 |
269 | {'colsample_bytree': 0.6, 'max_depth': None, '... | 0.815368 | 0.014855 |
270 rows × 3 columns
XGB_model_CV.best_estimator_
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.4, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=8, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=25, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.4, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=8, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=25, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...)
XGB_model_CV.score(X_train, y_train), XGB_model_CV.score(X_test, y_test)
(0.8940609951845907, 0.8208955223880597)
train_result = classification_table_AA(XGB_model_CV,X_train, y_train)
train_result
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 384.0 | 0.0 | 239.0 | 0.383628 | 0.383628 | 1.000000 | 0.554524 |
1 | 0.1 | 150.0 | 234.0 | 3.0 | 236.0 | 0.619583 | 0.502128 | 0.987448 | 0.665726 |
2 | 0.2 | 278.0 | 106.0 | 17.0 | 222.0 | 0.802568 | 0.676829 | 0.928870 | 0.783069 |
3 | 0.3 | 331.0 | 53.0 | 30.0 | 209.0 | 0.866774 | 0.797710 | 0.874477 | 0.834331 |
4 | 0.4 | 345.0 | 39.0 | 33.0 | 206.0 | 0.884430 | 0.840816 | 0.861925 | 0.851240 |
5 | 0.5 | 355.0 | 29.0 | 37.0 | 202.0 | 0.894061 | 0.874459 | 0.845188 | 0.859574 |
6 | 0.6 | 363.0 | 21.0 | 56.0 | 183.0 | 0.876404 | 0.897059 | 0.765690 | 0.826185 |
7 | 0.7 | 375.0 | 9.0 | 84.0 | 155.0 | 0.850722 | 0.945122 | 0.648536 | 0.769231 |
8 | 0.8 | 382.0 | 2.0 | 123.0 | 116.0 | 0.799358 | 0.983051 | 0.485356 | 0.649860 |
9 | 0.9 | 384.0 | 0.0 | 173.0 | 66.0 | 0.722311 | 1.000000 | 0.276151 | 0.432787 |
10 | 1.0 | 384.0 | 0.0 | 239.0 | 0.0 | 0.616372 | 0.000000 | 0.000000 | 0.000000 |
test_result = classification_table_AA(XGB_model_CV,X_test, y_test)
test_result
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 87.0 | 0.0 | 47.0 | 0.350746 | 0.350746 | 1.000000 | 0.519337 |
1 | 0.1 | 27.0 | 60.0 | 5.0 | 42.0 | 0.514925 | 0.411765 | 0.893617 | 0.563758 |
2 | 0.2 | 57.0 | 30.0 | 8.0 | 39.0 | 0.716418 | 0.565217 | 0.829787 | 0.672414 |
3 | 0.3 | 66.0 | 21.0 | 12.0 | 35.0 | 0.753731 | 0.625000 | 0.744681 | 0.679612 |
4 | 0.4 | 74.0 | 13.0 | 14.0 | 33.0 | 0.798507 | 0.717391 | 0.702128 | 0.709677 |
5 | 0.5 | 77.0 | 10.0 | 14.0 | 33.0 | 0.820896 | 0.767442 | 0.702128 | 0.733333 |
6 | 0.6 | 80.0 | 7.0 | 17.0 | 30.0 | 0.820896 | 0.810811 | 0.638298 | 0.714286 |
7 | 0.7 | 80.0 | 7.0 | 21.0 | 26.0 | 0.791045 | 0.787879 | 0.553191 | 0.650000 |
8 | 0.8 | 85.0 | 2.0 | 27.0 | 20.0 | 0.783582 | 0.909091 | 0.425532 | 0.579710 |
9 | 0.9 | 87.0 | 0.0 | 39.0 | 8.0 | 0.708955 | 1.000000 | 0.170213 | 0.290909 |
10 | 1.0 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
ROC_Curve(XGB_model_CV,y_train,X_train,y_test,X_test)
Support Vector Machine (SVM) and perform Hyperparameter Tunning¶¶
from sklearn.svm import SVC
SVM = SVC()
#SVM.probability=True # Not recommended, as it makes it extremely slow
SVM_model = SVM.fit(X_train, y_train)
SVM_model.score(X_train, y_train), SVM_model.score(X_test, y_test)
(0.8170144462279294, 0.8432835820895522)
SVM = SVC() # probability=True
parameters = {
'kernel': ['linear', 'rbf'], #can also add 'sigmoid','poly'
'C': [0.01, 0.1, 1, 10]
}
SVM_CV = GridSearchCV(SVM, parameters, cv=5)
SVM_CV.fit(X_train, y_train.values.ravel())
CV_Print(SVM_CV)
Params | Means | Stds | |
---|---|---|---|
0 | {'C': 0.01, 'kernel': 'linear'} | 0.754465 | 0.022440 |
1 | {'C': 0.01, 'kernel': 'rbf'} | 0.616374 | 0.002591 |
2 | {'C': 0.1, 'kernel': 'linear'} | 0.791265 | 0.018372 |
3 | {'C': 0.1, 'kernel': 'rbf'} | 0.804116 | 0.015619 |
4 | {'C': 1, 'kernel': 'linear'} | 0.791265 | 0.018372 |
5 | {'C': 1, 'kernel': 'rbf'} | 0.808955 | 0.011228 |
6 | {'C': 10, 'kernel': 'linear'} | 0.791265 | 0.018372 |
7 | {'C': 10, 'kernel': 'rbf'} | 0.818568 | 0.015521 |
SVM_CV.best_estimator_
SVC(C=10)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(C=10)
SVM_model_CV = SVM_CV.fit(X_train, y_train.values.ravel())
SVM_model_CV.score(X_train, y_train), SVM_model_CV.score(X_test, y_test)
(0.8250401284109149, 0.8283582089552238)
preds = SVM_model_CV.predict(X_test)
preds[1:10]
array([0, 0, 0, 1, 0, 1, 0, 0, 1], dtype=int64)
K-Nearest Neighbors and perform Hyperparameter Tunning¶¶
from sklearn.neighbors import KNeighborsClassifier
# Create a KNN classifier object
knn = KNeighborsClassifier()
model_KNN = knn.fit(X_train, y_train)
model_KNN.score(X_train, y_train), model_KNN.score(X_test, y_test)
(0.8635634028892456, 0.7910447761194029)
preds = model_KNN.predict(X_test)
preds[1:10]
array([0, 0, 1, 1, 0, 1, 0, 0, 1], dtype=int64)
preds = model_KNN.predict_proba(X_test)[:,1]
preds[1:10]
array([0. , 0.4, 0.6, 1. , 0. , 1. , 0.2, 0. , 0.8])
train_result = classification_table_AA(model_KNN,X_train, y_train)
train_result
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 384.0 | 0.0 | 239.0 | 0.383628 | 0.383628 | 1.000000 | 0.554524 |
1 | 0.1 | 214.0 | 170.0 | 6.0 | 233.0 | 0.717496 | 0.578164 | 0.974895 | 0.725857 |
2 | 0.2 | 214.0 | 170.0 | 6.0 | 233.0 | 0.717496 | 0.578164 | 0.974895 | 0.725857 |
3 | 0.3 | 295.0 | 89.0 | 22.0 | 217.0 | 0.821830 | 0.709150 | 0.907950 | 0.796330 |
4 | 0.4 | 295.0 | 89.0 | 22.0 | 217.0 | 0.821830 | 0.709150 | 0.907950 | 0.796330 |
5 | 0.5 | 350.0 | 34.0 | 51.0 | 188.0 | 0.863563 | 0.846847 | 0.786611 | 0.815618 |
6 | 0.6 | 370.0 | 14.0 | 79.0 | 160.0 | 0.850722 | 0.919540 | 0.669456 | 0.774818 |
7 | 0.7 | 370.0 | 14.0 | 79.0 | 160.0 | 0.850722 | 0.919540 | 0.669456 | 0.774818 |
8 | 0.8 | 370.0 | 14.0 | 79.0 | 160.0 | 0.850722 | 0.919540 | 0.669456 | 0.774818 |
9 | 0.9 | 382.0 | 2.0 | 135.0 | 104.0 | 0.780096 | 0.981132 | 0.435146 | 0.602899 |
10 | 1.0 | 382.0 | 2.0 | 135.0 | 104.0 | 0.780096 | 0.981132 | 0.435146 | 0.602899 |
# Let's plot ROC curve for KNN Model
ROC_Curve(model_KNN,y_train,X_train,y_test,X_test)
KNN_model_CV = KNeighborsClassifier()
parameters = {
'n_neighbors': [3,5,7,9,11],
'weights': ['uniform', 'distance'],
'metric' : ['minkowski','euclidean','manhattan']
}
KNN_model_CV = GridSearchCV(KNN_model_CV, parameters, cv=5)
KNN_model_CV.fit(X_train, y_train.values.ravel())
CV_Print(KNN_model_CV)
Params | Means | Stds | |
---|---|---|---|
0 | {'metric': 'minkowski', 'n_neighbors': 3, 'wei... | 0.804168 | 0.016600 |
1 | {'metric': 'minkowski', 'n_neighbors': 3, 'wei... | 0.786439 | 0.022656 |
2 | {'metric': 'minkowski', 'n_neighbors': 5, 'wei... | 0.823497 | 0.022813 |
3 | {'metric': 'minkowski', 'n_neighbors': 5, 'wei... | 0.797716 | 0.017589 |
4 | {'metric': 'minkowski', 'n_neighbors': 7, 'wei... | 0.817058 | 0.023160 |
5 | {'metric': 'minkowski', 'n_neighbors': 7, 'wei... | 0.797703 | 0.016182 |
6 | {'metric': 'minkowski', 'n_neighbors': 9, 'wei... | 0.823419 | 0.023946 |
7 | {'metric': 'minkowski', 'n_neighbors': 9, 'wei... | 0.797703 | 0.016954 |
8 | {'metric': 'minkowski', 'n_neighbors': 11, 'we... | 0.821781 | 0.019055 |
9 | {'metric': 'minkowski', 'n_neighbors': 11, 'we... | 0.797703 | 0.016954 |
10 | {'metric': 'euclidean', 'n_neighbors': 3, 'wei... | 0.804168 | 0.016600 |
11 | {'metric': 'euclidean', 'n_neighbors': 3, 'wei... | 0.786439 | 0.022656 |
12 | {'metric': 'euclidean', 'n_neighbors': 5, 'wei... | 0.823497 | 0.022813 |
13 | {'metric': 'euclidean', 'n_neighbors': 5, 'wei... | 0.797716 | 0.017589 |
14 | {'metric': 'euclidean', 'n_neighbors': 7, 'wei... | 0.817058 | 0.023160 |
15 | {'metric': 'euclidean', 'n_neighbors': 7, 'wei... | 0.797703 | 0.016182 |
16 | {'metric': 'euclidean', 'n_neighbors': 9, 'wei... | 0.823419 | 0.023946 |
17 | {'metric': 'euclidean', 'n_neighbors': 9, 'wei... | 0.797703 | 0.016954 |
18 | {'metric': 'euclidean', 'n_neighbors': 11, 'we... | 0.821781 | 0.019055 |
19 | {'metric': 'euclidean', 'n_neighbors': 11, 'we... | 0.797703 | 0.016954 |
20 | {'metric': 'manhattan', 'n_neighbors': 3, 'wei... | 0.807381 | 0.013470 |
21 | {'metric': 'manhattan', 'n_neighbors': 3, 'wei... | 0.789652 | 0.022518 |
22 | {'metric': 'manhattan', 'n_neighbors': 5, 'wei... | 0.826710 | 0.025280 |
23 | {'metric': 'manhattan', 'n_neighbors': 5, 'wei... | 0.800929 | 0.015996 |
24 | {'metric': 'manhattan', 'n_neighbors': 7, 'wei... | 0.818658 | 0.020997 |
25 | {'metric': 'manhattan', 'n_neighbors': 7, 'wei... | 0.797703 | 0.016182 |
26 | {'metric': 'manhattan', 'n_neighbors': 9, 'wei... | 0.826619 | 0.026361 |
27 | {'metric': 'manhattan', 'n_neighbors': 9, 'wei... | 0.799303 | 0.015932 |
28 | {'metric': 'manhattan', 'n_neighbors': 11, 'we... | 0.823381 | 0.024715 |
29 | {'metric': 'manhattan', 'n_neighbors': 11, 'we... | 0.800903 | 0.019857 |
KNN_model_CV.best_estimator_
KNeighborsClassifier(metric='manhattan')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(metric='manhattan')
accuracy = KNN_model_CV.score(X_test, y_test)
print("Test Accuracy: ", accuracy)
Test Accuracy: 0.7910447761194029
train_result = classification_table_AA(KNN_model_CV,X_train, y_train)
train_result
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 384.0 | 0.0 | 239.0 | 0.383628 | 0.383628 | 1.000000 | 0.554524 |
1 | 0.1 | 214.0 | 170.0 | 6.0 | 233.0 | 0.717496 | 0.578164 | 0.974895 | 0.725857 |
2 | 0.2 | 214.0 | 170.0 | 6.0 | 233.0 | 0.717496 | 0.578164 | 0.974895 | 0.725857 |
3 | 0.3 | 295.0 | 89.0 | 22.0 | 217.0 | 0.821830 | 0.709150 | 0.907950 | 0.796330 |
4 | 0.4 | 295.0 | 89.0 | 22.0 | 217.0 | 0.821830 | 0.709150 | 0.907950 | 0.796330 |
5 | 0.5 | 350.0 | 34.0 | 51.0 | 188.0 | 0.863563 | 0.846847 | 0.786611 | 0.815618 |
6 | 0.6 | 370.0 | 14.0 | 77.0 | 162.0 | 0.853933 | 0.920455 | 0.677824 | 0.780723 |
7 | 0.7 | 370.0 | 14.0 | 77.0 | 162.0 | 0.853933 | 0.920455 | 0.677824 | 0.780723 |
8 | 0.8 | 370.0 | 14.0 | 77.0 | 162.0 | 0.853933 | 0.920455 | 0.677824 | 0.780723 |
9 | 0.9 | 382.0 | 2.0 | 135.0 | 104.0 | 0.780096 | 0.981132 | 0.435146 | 0.602899 |
10 | 1.0 | 382.0 | 2.0 | 135.0 | 104.0 | 0.780096 | 0.981132 | 0.435146 | 0.602899 |
test_result = classification_table_AA(KNN_model_CV,X_test, y_test)
test_result
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 87.0 | 0.0 | 47.0 | 0.350746 | 0.350746 | 1.000000 | 0.519337 |
1 | 0.1 | 44.0 | 43.0 | 8.0 | 39.0 | 0.619403 | 0.475610 | 0.829787 | 0.604651 |
2 | 0.2 | 44.0 | 43.0 | 8.0 | 39.0 | 0.619403 | 0.475610 | 0.829787 | 0.604651 |
3 | 0.3 | 62.0 | 25.0 | 8.0 | 39.0 | 0.753731 | 0.609375 | 0.829787 | 0.702703 |
4 | 0.4 | 62.0 | 25.0 | 8.0 | 39.0 | 0.753731 | 0.609375 | 0.829787 | 0.702703 |
5 | 0.5 | 72.0 | 15.0 | 13.0 | 34.0 | 0.791045 | 0.693878 | 0.723404 | 0.708333 |
6 | 0.6 | 80.0 | 7.0 | 18.0 | 29.0 | 0.813433 | 0.805556 | 0.617021 | 0.698795 |
7 | 0.7 | 80.0 | 7.0 | 18.0 | 29.0 | 0.813433 | 0.805556 | 0.617021 | 0.698795 |
8 | 0.8 | 80.0 | 7.0 | 18.0 | 29.0 | 0.813433 | 0.805556 | 0.617021 | 0.698795 |
9 | 0.9 | 86.0 | 1.0 | 26.0 | 21.0 | 0.798507 | 0.954545 | 0.446809 | 0.608696 |
10 | 1.0 | 86.0 | 1.0 | 26.0 | 21.0 | 0.798507 | 0.954545 | 0.446809 | 0.608696 |
ROC_Curve(KNN_model_CV,y_train,X_train,y_test,X_test)
Neural Network Model and perform Hyperparameter Tunning¶¶
#!pip install tensorflow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Create a TensorFlow model
def create_model(units=16, activation='relu', optimizer='adam'):
model = Sequential()
model.add(Dense(units, activation=activation, input_dim=5))
model.add(Dense(units, activation=activation))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
return model
model = Sequential(): This line creates a new sequential model object. The sequential model is a linear stack of layers, where you can add layers one by one.
model.add(Dense(units, activation=activation, input_dim=5)): This line adds a dense layer to the model. A dense layer is a fully connected layer, where each neuron is connected to all the neurons in the previous layer. The units parameter specifies the number of neurons in the layer. The activation parameter defines the activation function to be used in the layer. The input_dim parameter is used only for the first layer and specifies the input shape.
model.add(Dense(units, activation=activation)): This line adds another dense layer to the model. It has the same number of units as the previous layer and uses the same activation function.
model.add(Dense(3, activation='softmax')): This line adds the output layer to the model. It is a dense layer with 3 neurons, representing the number of classes in the classification problem. The activation function 'softmax' is used to compute the probability distribution over the classes.
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy']): This line compiles the model. It specifies the optimizer to be used during training, the loss function to optimize (sparse categorical cross-entropy for multi-class classification problems), and the metrics to evaluate during training and testing (in this case, accuracy).
NN_model = tf.keras.wrappers.scikit_learn.KerasClassifier(build_fn=create_model, epochs=10, verbose=0)
model_tf = NN_model.fit(X_train_scaled, y_train)
model_history = NN_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=150)
simple method to optimize the model¶
# summarize history for loss
plt.plot(model_history.history['loss'])
plt.plot(model_history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
model_tf = NN_model.fit(X_train_scaled, y_train, epochs=18)
print(accuracy_score(NN_model.predict(X_train), y_train), "\n",
accuracy_score(NN_model.predict(X_test), y_test))
20/20 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 1ms/step 0.6211878009630819 0.6492537313432836
test_result = classification_table_AA(NN_model,X_test, y_test)
test_result
5/5 [==============================] - 0s 3ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 3ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 87.0 | 0.0 | 47.0 | 0.350746 | 0.350746 | 1.000000 | 0.519337 |
1 | 0.1 | 61.0 | 26.0 | 11.0 | 36.0 | 0.723881 | 0.580645 | 0.765957 | 0.660550 |
2 | 0.2 | 76.0 | 11.0 | 23.0 | 24.0 | 0.746269 | 0.685714 | 0.510638 | 0.585366 |
3 | 0.3 | 87.0 | 0.0 | 37.0 | 10.0 | 0.723881 | 1.000000 | 0.212766 | 0.350877 |
4 | 0.4 | 87.0 | 0.0 | 45.0 | 2.0 | 0.664179 | 1.000000 | 0.042553 | 0.081633 |
5 | 0.5 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
6 | 0.6 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
7 | 0.7 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
8 | 0.8 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
9 | 0.9 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
10 | 1.0 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
ROC_Curve(NN_model,y_train,X_train_scaled,y_test,X_test_scaled)
20/20 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step
Hyperparamter Tuning of the Neural network Model¶
model.get_params().keys()
dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])
parameters = {
'units': [8, 16, 32],
'activation': ['relu', 'sigmoid'],
'optimizer': ['adam', 'rmsprop']
}
model_tf_CV = GridSearchCV(NN_model, parameters, cv=5)
model_tf_CV.fit(X_train_scaled, y_train.values.ravel())
CV_Print(model_tf_CV)
Params | Means | Stds | |
---|---|---|---|
0 | {'activation': 'relu', 'optimizer': 'adam', 'u... | 0.739987 | 0.039263 |
1 | {'activation': 'relu', 'optimizer': 'adam', 'u... | 0.781729 | 0.042328 |
2 | {'activation': 'relu', 'optimizer': 'adam', 'u... | 0.810490 | 0.029660 |
3 | {'activation': 'relu', 'optimizer': 'rmsprop',... | 0.776929 | 0.038429 |
4 | {'activation': 'relu', 'optimizer': 'rmsprop',... | 0.805742 | 0.018218 |
5 | {'activation': 'relu', 'optimizer': 'rmsprop',... | 0.797716 | 0.026300 |
6 | {'activation': 'sigmoid', 'optimizer': 'adam',... | 0.616477 | 0.043670 |
7 | {'activation': 'sigmoid', 'optimizer': 'adam',... | 0.626103 | 0.047453 |
8 | {'activation': 'sigmoid', 'optimizer': 'adam',... | 0.711084 | 0.042661 |
9 | {'activation': 'sigmoid', 'optimizer': 'rmspro... | 0.616477 | 0.043670 |
10 | {'activation': 'sigmoid', 'optimizer': 'rmspro... | 0.621277 | 0.047170 |
11 | {'activation': 'sigmoid', 'optimizer': 'rmspro... | 0.736916 | 0.057874 |
# Print the best hyperparameters and the corresponding mean cross-validated score
print("Best Hyperparameters: ", model_tf_CV.best_params_)
print("Best Score: ", model_tf_CV.best_score_)
# Evaluate the model on the test set using the best hyperparameters
best_model = model_tf_CV.best_estimator_
print("Best_Model: ", best_model)
Best Hyperparameters: {'activation': 'relu', 'optimizer': 'adam', 'units': 32} Best Score: 0.8104903221130371 Best_Model: <keras.wrappers.scikit_learn.KerasClassifier object at 0x000002518B6AFD60>
train_result = classification_table_AA(model_tf_CV,X_test, y_test)
train_result
5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 3ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 3ms/step 5/5 [==============================] - 0s 1ms/step 5/5 [==============================] - 0s 3ms/step
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 87.0 | 0.0 | 47.0 | 0.350746 | 0.350746 | 1.000000 | 0.519337 |
1 | 0.1 | 67.0 | 20.0 | 18.0 | 29.0 | 0.716418 | 0.591837 | 0.617021 | 0.604167 |
2 | 0.2 | 87.0 | 0.0 | 24.0 | 23.0 | 0.820896 | 1.000000 | 0.489362 | 0.657143 |
3 | 0.3 | 87.0 | 0.0 | 36.0 | 11.0 | 0.731343 | 1.000000 | 0.234043 | 0.379310 |
4 | 0.4 | 87.0 | 0.0 | 45.0 | 2.0 | 0.664179 | 1.000000 | 0.042553 | 0.081633 |
5 | 0.5 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
6 | 0.6 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
7 | 0.7 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
8 | 0.8 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
9 | 0.9 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
10 | 1.0 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
test_result = classification_table_AA(model_tf_CV,X_test, y_test)
test_result
5/5 [==============================] - 0s 3ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 3ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step
Threshold | TP | FP | FN | TN | Accuracy | Precision | TPR_Recall | F1_Score | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 87.0 | 0.0 | 47.0 | 0.350746 | 0.350746 | 1.000000 | 0.519337 |
1 | 0.1 | 67.0 | 20.0 | 18.0 | 29.0 | 0.716418 | 0.591837 | 0.617021 | 0.604167 |
2 | 0.2 | 87.0 | 0.0 | 24.0 | 23.0 | 0.820896 | 1.000000 | 0.489362 | 0.657143 |
3 | 0.3 | 87.0 | 0.0 | 36.0 | 11.0 | 0.731343 | 1.000000 | 0.234043 | 0.379310 |
4 | 0.4 | 87.0 | 0.0 | 45.0 | 2.0 | 0.664179 | 1.000000 | 0.042553 | 0.081633 |
5 | 0.5 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
6 | 0.6 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
7 | 0.7 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
8 | 0.8 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
9 | 0.9 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
10 | 1.0 | 87.0 | 0.0 | 47.0 | 0.0 | 0.649254 | 0.000000 | 0.000000 | 0.000000 |
ROC_Curve(model_tf_CV,y_train,X_train_scaled,y_test,X_test_scaled)
20/20 [==============================] - 0s 2ms/step 5/5 [==============================] - 0s 2ms/step
Neural network model is not working properly here because of Insufficient data, it need high volume of data. Other cases, where it might not work well are:
Imbalanced classes: If you have imbalanced classes in your dataset, where the positive class is rare compared to the negative class, it might be challenging for the model to correctly predict the positive class. In such cases, the model might be biased towards predicting the negative class, leading to low precision for the positive class.
Insufficient training data: If you have limited training data or an insufficient representation of the positive class, the model might struggle to learn patterns specific to the positive class. This can result in lower precision when predicting the positive class, especially when the model's predictions are based on probabilities.
nappropriate probability threshold: The probability threshold you are using to calculate precision might not be optimal for your specific problem. By default, the threshold for classification is usually set at 0.5, but you can adjust this threshold to achieve a desired precision-recall trade-off based on your problem requirements.
Compare model results and Champion model selection¶
In this section, we will:
- Evaluate all of our saved models on the validation set
- Select the best model based on performance on the validation set
- Evaluate that model on the holdout test set
import joblib
from time import time
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
joblib.dump(Logistic_model_cv.best_estimator_, 'Logistic_model_cv.pkl')
joblib.dump(DT_model_CV.best_estimator_, 'DT_model_CV.pkl')
joblib.dump(RF_model_CV.best_estimator_, 'RF_model_CV.pkl')
joblib.dump(XGB_model_CV.best_estimator_, 'XGB_model_CV.pkl')
joblib.dump(SVM_model_CV.best_estimator_, 'SVM_model_CV.pkl')
joblib.dump(KNN_model_CV.best_estimator_, 'KNN_model_CV.pkl')
joblib.dump(model_tf_CV.best_estimator_, 'model_tf_CV.pkl')
['model_tf_CV.pkl']
list_model = ['Logistic_model_cv', 'DT_model_CV', 'RF_model_CV', 'XGB_model_CV', 'SVM_model_CV','KNN_model_CV','model_tf_CV']
len(list_model)
7
models = {}
for mdl in list_model:
models[mdl] = joblib.load('{}.pkl'.format(mdl))
models.values()
dict_values([LogisticRegression(C=40), DecisionTreeClassifier(max_depth=5, min_samples_leaf=8, random_state=1234), RandomForestClassifier(max_depth=8, n_estimators=250), XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.4, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=8, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=25, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, ...), SVC(C=10), KNeighborsClassifier(metric='manhattan'), <keras.wrappers.scikit_learn.KerasClassifier object at 0x000002518C92C070>])
def evaluate_model_AA(list_model, X, y):
df = pd.DataFrame(columns=['Model','Accuracy','Precision','Recall','F1_score','Latency'])
for n in range(len(list_model)):
name = list(models.keys())[n]
mdl = list(models.values())[n]
start = time()
pred = mdl.predict(X)
end = time()
accuracy = round(accuracy_score(y, pred), 3)
precision = round(precision_score(y, pred), 3)
recall = round(recall_score(y, pred), 3)
f1 = round(f1_score(y, pred), 3)
new_row = {'Model' : name
, 'Accuracy' : accuracy
, 'Precision' : precision
, 'Recall' : recall
, 'F1_score' : f1
, 'Latency' : round((end - start)*1000, 1)
}
df = df.append(new_row, ignore_index=True)
return(df)
evaluate_model_AA(list_model, X_train, y_train)
20/20 [==============================] - 0s 2ms/step
Model | Accuracy | Precision | Recall | F1_score | Latency | |
---|---|---|---|---|---|---|
0 | Logistic_model_cv | 0.807 | 0.762 | 0.724 | 0.742 | 2.1 |
1 | DT_model_CV | 0.849 | 0.825 | 0.770 | 0.797 | 1.0 |
2 | RF_model_CV | 0.913 | 0.915 | 0.854 | 0.883 | 72.7 |
3 | XGB_model_CV | 0.894 | 0.874 | 0.845 | 0.860 | 4.3 |
4 | SVM_model_CV | 0.825 | 0.801 | 0.724 | 0.760 | 32.9 |
5 | KNN_model_CV | 0.864 | 0.847 | 0.787 | 0.816 | 33.0 |
6 | model_tf_CV | 0.616 | 0.000 | 0.000 | 0.000 | 180.6 |
evaluate_model_AA(list_model, X_test, y_test)
5/5 [==============================] - 0s 2ms/step
Model | Accuracy | Precision | Recall | F1_score | Latency | |
---|---|---|---|---|---|---|
0 | Logistic_model_cv | 0.754 | 0.659 | 0.617 | 0.637 | 2.0 |
1 | DT_model_CV | 0.806 | 0.744 | 0.681 | 0.711 | 3.0 |
2 | RF_model_CV | 0.799 | 0.738 | 0.660 | 0.697 | 56.8 |
3 | XGB_model_CV | 0.821 | 0.767 | 0.702 | 0.733 | 7.2 |
4 | SVM_model_CV | 0.828 | 0.800 | 0.681 | 0.736 | 9.3 |
5 | KNN_model_CV | 0.791 | 0.694 | 0.723 | 0.708 | 8.0 |
6 | model_tf_CV | 0.649 | 0.000 | 0.000 | 0.000 | 87.7 |
evaluate_model_AA(list_model, X_val, y_val)
5/5 [==============================] - 0s 2ms/step
Model | Accuracy | Precision | Recall | F1_score | Latency | |
---|---|---|---|---|---|---|
0 | Logistic_model_cv | 0.784 | 0.814 | 0.625 | 0.707 | 1.9 |
1 | DT_model_CV | 0.843 | 0.872 | 0.732 | 0.796 | 3.2 |
2 | RF_model_CV | 0.881 | 0.955 | 0.750 | 0.840 | 41.8 |
3 | XGB_model_CV | 0.873 | 0.915 | 0.768 | 0.835 | 2.0 |
4 | SVM_model_CV | 0.843 | 0.889 | 0.714 | 0.792 | 9.7 |
5 | KNN_model_CV | 0.821 | 0.833 | 0.714 | 0.769 | 6.5 |
6 | model_tf_CV | 0.582 | 0.000 | 0.000 | 0.000 | 89.2 |
No comments:
Post a Comment
Do provide us your feedback, it would help us serve your better.