Classification_Model-Part 1_0606

Classification models are a vital tool in data science, and Python offers a versatile ecosystem for building them. Leveraging libraries such as scikit-learn and TensorFlow, you can create powerful classifiers to categorize data into predefined groups. From traditional algorithms like Logistic Regression to cutting-edge neural networks, Python empowers data scientists to craft accurate and efficient classification models for a wide range of applications. Whether it's spam detection, sentiment analysis, or disease diagnosis, Python's rich resources make it the go-to choice for tackling classification challenges. In this blog, we delve into the fundamentals of various classification modeling techniques using Python. We not only explore the essential concepts but also guide you through the art of fine-tuning these models. Whether you're a beginner seeking a solid foundation or an experienced data scientist aiming to enhance your skills, our blog equips you with the knowledge and tools to build robust classifiers for diverse applications.

In [1]:

#Importing key libraries to be used in this project

import pandas as pd
import pandasql as ps
import numpy as np
import math as mt
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
from matplotlib import pyplot as plt

In [2]:

# Set the display options to show all columns

pd.set_option('display.max_columns', None)

In [3]:

# This is openly available data on many websites e.g. Kaggle
raw_data = pd.read_csv('input_dataset/titanic.csv')

In [4]:

raw_data.head(2)

Out[4]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C

In [5]:

# Checked the rows and columns in the data - feeling the data
raw_data.shape

Out[5]:

(891, 12)

In [6]:

raw_data.columns

Out[6]:

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Module 0.Exploratory Data Analysis (EDA)¶

In our blog, we've taken on the intriguing challenge of predicting survival on the Titanic using the renowned Kaggle 'Titanic dataset.' This dataset comprises details of 891 passengers who embarked on that fateful journey in 1912. As we dive into this historical voyage, we uncover the factors that influenced survival, where gender, age, and social class played pivotal roles. Armed with Python and classification models, we embark on a data-driven journey to develop a predictive model that sheds light on who would have likely survived this iconic disaster. Join us as we navigate the waters of data science to unearth the hidden insights within this Titanic dataset

Following are the fields in the data:

-- Name (str) - Name of the passenger 

-- Pclass (int) - Ticket class

-- Sex (str) - Sex of the passenger

-- Age (float) - Age in years

-- SibSp (int) - Number of siblings and spouses aboard

-- Parch (int) - Number of parents and children aboard

-- Ticket (str) - Ticket number

-- Fare (float) - Passenger fare

-- Cabin (str) - Cabin number

-- Embarked (str) - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

In [7]:

# Checking basisc stats of the data
raw_data.describe().T  # T transposes the data

Out[7]:

	count	mean	std	min	25%	50%	75%	max
PassengerId	891.0	446.000000	257.353842	1.00	223.5000	446.0000	668.5	891.0000
Survived	891.0	0.383838	0.486592	0.00	0.0000	0.0000	1.0	1.0000
Pclass	891.0	2.308642	0.836071	1.00	2.0000	3.0000	3.0	3.0000
Age	714.0	29.699118	14.526497	0.42	20.1250	28.0000	38.0	80.0000
SibSp	891.0	0.523008	1.102743	0.00	0.0000	0.0000	1.0	8.0000
Parch	891.0	0.381594	0.806057	0.00	0.0000	0.0000	0.0	6.0000
Fare	891.0	32.204208	49.693429	0.00	7.9104	14.4542	31.0	512.3292

In [8]:

# Checking the y-variable distribution
raw_data.Survived.value_counts(), raw_data.Survived.value_counts(normalize = True)

Out[8]:

(0    549
 1    342
 Name: Survived, dtype: int64,
 0    0.616162
 1    0.383838
 Name: Survived, dtype: float64)

In [9]:

plt.figure(figsize=(3,3))
raw_data.Survived.value_counts(normalize = True).plot(kind = 'pie',
                            autopct = '%.2f%%',
                            labels = ['0', '1'],
                            title  = 'Distribution of Target Variable');

In [10]:

# Developed a modular function to create bar charts as required

def Bi_variate_analysis_plot(data,x,y):
    plt.figure(figsize=(3,3))
    xtab = pd.crosstab(index = data[x], columns = raw_data[y], normalize = 'columns')
    xtab.plot.bar()
    plt.ylabel("Count")
    plt.legend()
    return plt

In [11]:

Bi_variate_analysis_plot(data = raw_data ,x = 'Pclass',y = 'Survived')

Out[11]:

<module 'matplotlib.pyplot' from 'C:\\Users\\ragarwal\\Anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py'>

<Figure size 300x300 with 0 Axes>

In [12]:

Bi_variate_analysis_plot(data = raw_data ,x = 'Parch',y = 'Survived')

Out[12]:

<module 'matplotlib.pyplot' from 'C:\\Users\\ragarwal\\Anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py'>

<Figure size 300x300 with 0 Axes>

In [13]:

Bi_variate_analysis_plot(data = raw_data ,x = 'Embarked',y = 'Survived')

Out[13]:

<module 'matplotlib.pyplot' from 'C:\\Users\\ragarwal\\Anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py'>

<Figure size 300x300 with 0 Axes>

In [14]:

# Let's see what missing values are we dealing with
raw_data.isna().sum()

Out[14]:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              5
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We have missing values in Age, Sex, Embarked and Cabin field. Looks like the "Cabin" field is hardly having anny data.

Next step after EDA is data cleaning, which includes Missing value treatment, Outlier Treatment etc.¶

Module 1. Missing Value Treatment¶

In [15]:

#Also, let' make a copy of data
raw_data_stg1 = raw_data.copy()

There could be several methods to treat missing values:¶

> Simplest method is deleting the records with missing values - Not preferred if too many records are getting deleted¶

> Deleting the columns having too many missing values¶

> Imputing missing values with Mode (categorical variable), Mean/Median (continuous variable)¶

> Defining missing as a separate category itself (generally used for categorical variable)¶

> Imputation using machine learning itself - using KNN [not covered in this blog, will write a detailed blog for this later]¶

Below we have demonstrated most of the methods:

In [16]:

# Let's first treat missing value in 'sex' variable with Mode
raw_data_stg1.groupby('Sex')['Survived'].count()

Out[16]:

Sex
female    314
male      572
Name: Survived, dtype: int64

In [17]:

# Another way to find Mode
raw_data_stg1['Sex'].mode()[0]

Out[17]:

'male'

In [18]:

# Mode of the varaibles in Male, let's impute it with Mode
raw_data_stg1['Sex_imputed_1'] = raw_data_stg1['Sex'].fillna(value = raw_data_stg1['Sex'].mode()[0])
raw_data_stg1.groupby('Sex_imputed_1')['Survived'].count()

Out[18]:

Sex_imputed_1
female    314
male      577
Name: Survived, dtype: int64

There is no set-rule to impute the missing values.¶

It is not a pure science, it is an "art", and you need to be as creative as an "artist"¶

In [19]:

# We have noticed that Name contains prefix such as "Mr.", "Mrs.","Ms.", "Master" etc.
# We can use these to impute the Sex variable
raw_data_stg1[['Sex','Name']].head(5)

Out[19]:

	Sex	Name
0	male	Braund, Mr. Owen Harris
1	female	Cumings, Mrs. John Bradley (Florence Briggs Th...
2	female	Heikkinen, Miss. Laina
3	female	Futrelle, Mrs. Jacques Heath (Lily May Peel)
4	male	Allen, Mr. William Henry

In [20]:

raw_data_stg1['Salutation_temp'] = raw_data_stg1.Name.str.split(", ", expand = True)[1]

raw_data_stg1['Salutation'] = raw_data_stg1.Salutation_temp.str.split(" ", expand = True)[0]

raw_data_stg1[['Name','Salutation_temp','Salutation']].head()

Out[20]:

	Name	Salutation_temp	Salutation
0	Braund, Mr. Owen Harris	Mr. Owen Harris	Mr.
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	Mrs. John Bradley (Florence Briggs Thayer)	Mrs.
2	Heikkinen, Miss. Laina	Miss. Laina	Miss.
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	Mrs. Jacques Heath (Lily May Peel)	Mrs.
4	Allen, Mr. William Henry	Mr. William Henry	Mr.

In [21]:

raw_data_stg1.groupby('Salutation')['Salutation'].count()

Out[21]:

Salutation
Capt.          1
Col.           2
Don.           1
Dr.            7
Jonkheer.      1
Lady.          1
Major.         2
Master.       40
Miss.        182
Mlle.          2
Mme.           1
Mr.          517
Mrs.         125
Ms.            1
Rev.           6
Sir.           1
the            1
Name: Salutation, dtype: int64

The above information, will not just help in imputing the missing values, but also can provide us more variables

In [22]:

raw_data_stg1['check'] = raw_data_stg1['Sex'].isnull()
check_missing = raw_data_stg1[raw_data_stg1.check == True]
check_missing

Out[22]:

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked	Sex_imputed_1	Salutation_temp	Salutation	check
12	13	0	3	Saundercock, Mr. William Henry	NaN	20.0	A/5. 2151	8.0500	NaN	S	male	Mr. William Henry	Mr.	True
21	22	1	2	Beesley, Mr. Lawrence	NaN	34.0	248698	13.0000	D56	S	male	Mr. Lawrence	Mr.	True
26	27	0	3	Emir, Mr. Farred Chehab	NaN	NaN	2631	7.2250	NaN	C	male	Mr. Farred Chehab	Mr.	True
29	30	0	3	Todoroff, Mr. Lalio	NaN	NaN	349216	7.8958	NaN	S	male	Mr. Lalio	Mr.	True
42	43	0	3	Kraeff, Mr. Theodor	NaN	NaN	349253	7.8958	NaN	C	male	Mr. Theodor	Mr.	True

In [23]:

raw_data_stg1['Sex_imputed_2'] = raw_data_stg1['Sex'] 

raw_data_stg1.loc[raw_data_stg1['Salutation'].isin(['Mr.','Master.']) 
                  &  raw_data_stg1['Sex'].isnull() , 'Sex_imputed_2'] = 'male'

raw_data_stg1.loc[raw_data_stg1['Salutation'].isin(['Lady.','Miss.','Mrs.','Ms.']) 
                  &  raw_data_stg1['Sex'].isnull(), 'Sex_imputed_2'] = 'female'

In [24]:

raw_data_stg1['check'] = raw_data_stg1['Sex'].isnull()

check_missing = raw_data_stg1[raw_data_stg1.check == True]

check_missing[['Sex','Salutation','Sex_imputed_1','Sex_imputed_2']]

Out[24]:

	Sex	Salutation	Sex_imputed_1	Sex_imputed_2
12	NaN	Mr.	male	male
21	NaN	Mr.	male	male
26	NaN	Mr.	male	male
29	NaN	Mr.	male	male
42	NaN	Mr.	male	male

In [25]:

# let's now check the data where age is missing
raw_data_stg1['check'] = raw_data_stg1['Age'].isnull()

check_missing = raw_data_stg1[raw_data_stg1.check == True]

check_missing.head()

Out[25]:

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked	Sex_imputed_1	Salutation_temp	Salutation	check	Sex_imputed_2
5	6	0	3	Moran, Mr. James	male	NaN	330877	8.4583	NaN	Q	male	Mr. James	Mr.	True	male
17	18	1	2	Williams, Mr. Charles Eugene	male	NaN	244373	13.0000	NaN	S	male	Mr. Charles Eugene	Mr.	True	male
19	20	1	3	Masselmani, Mrs. Fatima	female	NaN	2649	7.2250	NaN	C	female	Mrs. Fatima	Mrs.	True	female
26	27	0	3	Emir, Mr. Farred Chehab	NaN	NaN	2631	7.2250	NaN	C	male	Mr. Farred Chehab	Mr.	True	male
28	29	1	3	O'Dwyer, Miss. Ellen "Nellie"	female	NaN	330959	7.8792	NaN	Q	female	Miss. Ellen "Nellie"	Miss.	True	female

In [26]:

# Another usage of salutation variable - We can calculate the bucket wise median age for
# Master, Mr, Ms. Mrs. and can do a better imputation
raw_data_stg1['median_age'] = raw_data_stg1.groupby(['Salutation'])['Age'].transform('median')

raw_data_stg1['Age'] = raw_data_stg1['Age'].fillna(value = raw_data_stg1['median_age'])

raw_data_stg1.head()

Out[26]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Sex_imputed_1	Salutation_temp	Salutation	check	Sex_imputed_2	median_age
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	male	Mr. Owen Harris	Mr.	False	male	30.0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	female	Mrs. John Bradley (Florence Briggs Thayer)	Mrs.	False	female	35.0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	female	Miss. Laina	Miss.	False	female	21.0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	female	Mrs. Jacques Heath (Lily May Peel)	Mrs.	False	female	35.0
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	male	Mr. William Henry	Mr.	False	male	30.0

In [27]:

raw_data_stg1.isna().sum()

Out[27]:

PassengerId          0
Survived             0
Pclass               0
Name                 0
Sex                  5
Age                  0
SibSp                0
Parch                0
Ticket               0
Fare                 0
Cabin              687
Embarked             2
Sex_imputed_1        0
Salutation_temp      0
Salutation           0
check                0
Sex_imputed_2        0
median_age           0
dtype: int64

In [28]:

raw_data_stg1['Embarked'].value_counts()

Out[28]:

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [29]:

raw_data_stg1['Embarked'].mode()[0]

Out[29]:

'S'

In [30]:

# For categorical variables, let's crete missing as a seperate category
raw_data_stg1['Embarked'] = raw_data_stg1['Embarked'].fillna(value = raw_data_stg1['Embarked'].mode()[0])

For the last variable with missing values - Cabin, we can drop the variable, since there are too many missing values.

However, dropping a variable should be last resort, if We can get some information out of it

In [31]:

# We created a dummy variable - if it is present, calling it 1 else calling it 0
raw_data_stg1['Cabin_ind'] = np.where(raw_data_stg1['Cabin'].isnull(), 0, 1)
raw_data_stg1.head()

Out[31]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Sex_imputed_1	Salutation_temp	Salutation	check	Sex_imputed_2	median_age	Cabin_ind
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	male	Mr. Owen Harris	Mr.	False	male	30.0	0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	female	Mrs. John Bradley (Florence Briggs Thayer)	Mrs.	False	female	35.0	1
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	female	Miss. Laina	Miss.	False	female	21.0	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	female	Mrs. Jacques Heath (Lily May Peel)	Mrs.	False	female	35.0	1
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	male	Mr. William Henry	Mr.	False	male	30.0	0

In [32]:

# This is just for demo purpose, we can use threshold method to drop a variables which has less populated values
# thresh - threshold for non-missing values
# Suppose, we want to keep varaibles which have at least 95% data

Threshold_cutoff = round(0.95* raw_data_stg1.shape[0],0)
Threshold_cutoff

x = raw_data_stg1.dropna(thresh =  Threshold_cutoff, axis = 1) 
x.shape

Out[32]:

(891, 18)

In [33]:

# Not recommended, but following can be used to delete all the records with any missing values in any column
x = raw_data_stg1.dropna(axis = 0)

# axis = 0 means row wise operation, axis = 1 means column wise operation

In [34]:

raw_data_stg1.isna().sum()

Out[34]:

PassengerId          0
Survived             0
Pclass               0
Name                 0
Sex                  5
Age                  0
SibSp                0
Parch                0
Ticket               0
Fare                 0
Cabin              687
Embarked             0
Sex_imputed_1        0
Salutation_temp      0
Salutation           0
check                0
Sex_imputed_2        0
median_age           0
Cabin_ind            0
dtype: int64

In [35]:

raw_data_stg1.shape

Out[35]:

(891, 19)

Module 2- Outlier Treatment¶

In [36]:

raw_data_stg2 = raw_data_stg1.copy()

In [37]:

# Seaborn is one of the commonly used libraries for data visualization using graphs and charts

import seaborn as sns
sns.set()
sns.set(style="darkgrid")

In [38]:

#generate histogram of all contionous variables
raw_data_stg2[['Age','Fare']].hist(figsize=(15,30),layout=(9,3));

In [39]:

# Let's first check age variable
sns.boxplot(data = raw_data_stg1, x = 'Age', orient='horizontal')

Out[39]:

<AxesSubplot:xlabel='Age'>

Although, in the box plot, it is showing data to have outlier (data outside the whiskers), however as per common/business sense age 80 is totally possible and hence using the principle if a data point can be explained, it is not an outlier, we will not treat it.

In [40]:

sns.set(style="darkgrid")
sns.boxplot(data = raw_data_stg1, x = 'Fare', orient='horizontal')

Out[40]:

<AxesSubplot:xlabel='Fare'>

In [41]:

Q1 = raw_data_stg2['Fare'].quantile(0.25)
Q3 = raw_data_stg2['Fare'].quantile(0.75)

whisker_1 = Q1 - (1.5*(Q3-Q1))
whisker_2 = Q3 + (1.5*(Q3-Q1))

whisker_1, whisker_2

Out[41]:

(-26.724, 65.6344)

In [42]:

Fare_dist = ps.sqldf('''
            select Fare 
            , count(Fare) as N
            from raw_data_stg2
            group by Fare
            order by Fare desc
        ''')
Fare_dist.head()

Out[42]:

	Fare	N
0	512.3292	3
1	263.0000	4
2	262.3750	2
3	247.5208	2
4	227.5250	4

In [43]:

# Using upper bound from the whisker plot, we can categorized the fare into "High" and "Low" categories
raw_data_stg2['High_FARE'] = 'No'
raw_data_stg2.loc[raw_data_stg2['Fare'] > 66 , 'High_FARE'] ='Yes'
raw_data_stg2.head()

Out[43]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Sex_imputed_1	Salutation_temp	Salutation	check	Sex_imputed_2	median_age	Cabin_ind	High_FARE
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	male	Mr. Owen Harris	Mr.	False	male	30.0	0	No
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	female	Mrs. John Bradley (Florence Briggs Thayer)	Mrs.	False	female	35.0	1	Yes
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	female	Miss. Laina	Miss.	False	female	21.0	0	No
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	female	Mrs. Jacques Heath (Lily May Peel)	Mrs.	False	female	35.0	1	No
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	male	Mr. William Henry	Mr.	False	male	30.0	0	No

In [44]:

cross_tab = pd.crosstab(index=raw_data_stg2['High_FARE'], columns=raw_data_stg2['Survived'])
print(cross_tab)

Survived     0    1
High_FARE          
No         512  263
Yes         37   79

In this case also, Looks like population with "High fare" is having signifactly higher survival rate, so it is not logical to treat the variable for outlier. We rather need to retain this information.

Looks like rich people got better opportunity for survival!

But suppose, we need to treat the outlier, it can be done following way.¶

This is just for demonstration purpose!

In [45]:

raw_data_stg2['Fare_demo'] =raw_data_stg2.Fare.apply(lambda x: whisker_2 if x > whisker_2
                                                     else whisker_1 if x < whisker_1
                                                     else x)

In [46]:

sns.set()
sns.set(style="darkgrid")
sns.boxplot(data = raw_data_stg2, x = 'Fare_demo', orient='horizontal')

Out[46]:

<AxesSubplot:xlabel='Fare_demo'>

In [47]:

raw_data_stg2.head(2)

Out[47]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Sex_imputed_1	Salutation_temp	Salutation	check	Sex_imputed_2	median_age	Cabin_ind	High_FARE	Fare_demo
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	male	Mr. Owen Harris	Mr.	False	male	30.0	0	No	7.2500
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	female	Mrs. John Bradley (Florence Briggs Thayer)	Mrs.	False	female	35.0	1	Yes	65.6344

Module 3- Variables Transformation¶

The maximum time and efforts of any model building project should be spent in this phase only, one needs to be as creative as possible with feature engineering, to unlock maximum potential of the data.

Golden rule : the better the variables, the better the model

For classification Model, following are the key transformations used:¶

>>> For Nominal categorical variables, we create dummy variables - also called hot encoding, will show here¶

>>> For Ordinal categorical variables, we should create index varaibles - also called label encoding, will show here¶

>>> Continuous variables are transformed with either:¶

     Binning method
     Scaling method

>>> We should try to derive new variables by:¶

    Logical derivations e.g. Age from DOB
    Variables interactions
    Smart variables with the help of a Deicison Tree

>>> Combinig sparse classes¶

In [48]:

raw_data_stg3 = raw_data_stg2.copy()

A. Combining Sparse Classes¶

Often while creating dummy variables for a categorical variables, we end-up getting too many variables. Suppose there are N classes, ideally we get N-1 dummy variables for it. What if N is 1000, we will get 999 varaibles, which is a very high number.

What can be done in order to control the # of variables?

Method 1 : Using WOE method, as explained later in this module, we can combine the classes and reduce number of variables
Method 2 : Combining sparse class logically

In [49]:

# In this data ,we don't have sparse classes so we are skipping it. In another project, we will demo it
# WOE concept and method has been explained a little later in this blog

B : Hot encoding for dummy variables creation for Nominal varaibles e.g. Sex_imputed_2¶

In [50]:

# In this step, you can either use Sex_imputed_2 , Sex_imputed_1 computed from above steps.

# as a matter of best practice, I am prefering Sex_imputed_2 since it is more rational and logical

In [51]:

raw_data_stg3['Sex'] = raw_data_stg3['Sex_imputed_2']

In [52]:

# dropping the variables
raw_data_stg3.drop(['Sex_imputed_2','Sex_imputed_1'], axis=1, inplace=True)

In [53]:

var_to_be_encoded = ['Sex','Embarked']

data_encoded = raw_data_stg3[var_to_be_encoded]
data_encoded = pd.get_dummies(data_encoded)
data_encoded.head(2)

Out[53]:

	Sex_female	Sex_male	Embarked_C	Embarked_Q	Embarked_S
0	0	1	0	0	1
1	1	0	1	0	0

In [54]:

##Adding back the encoded dummy variables data to the main dataframe

raw_data_stg3 = pd.concat([raw_data_stg3,data_encoded], axis = 1)
raw_data_stg3.head(2)

Out[54]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Salutation_temp	Salutation	check	median_age	Cabin_ind	High_FARE	Fare_demo	Sex_female	Sex_male	Embarked_C	Embarked_Q	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	Mr. Owen Harris	Mr.	False	30.0	0	No	7.2500	0	1	0	0	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	Mrs. John Bradley (Florence Briggs Thayer)	Mrs.	False	35.0	1	Yes	65.6344	1	0	1	0	0

In [55]:

raw_data_stg3.columns

Out[55]:

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Salutation_temp',
       'Salutation', 'check', 'median_age', 'Cabin_ind', 'High_FARE',
       'Fare_demo', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q',
       'Embarked_S'],
      dtype='object')

C. Label encoding for ordinal varaibles¶

In [56]:

# As such source sould not be a ordinal variables, but for illustration purpose, let's consider it to be so
raw_data_stg3.Embarked.unique()

# In that case, instead of creating dummy varaibles, we should create an "Label encoded ordinal variable"
# If we create dummy for oridinal varaible, we lose out "order" information
# For creating label encoded variables, we need to know the order of the values in the variable,
# Suppose in the variables "Embarked"  -->   Q > C > S, we can do following:

Out[56]:

array(['S', 'C', 'Q'], dtype=object)

In [57]:

# This variables has been created just for the demo purpose
raw_data_stg3['Embarked_DEMO_Label_Encoding'] = raw_data_stg3.Embarked.map({'S'   : 1,
                                                              'C'  : 2,
                                                              'Q' :  3})
raw_data_stg3.head(2)

Out[57]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Salutation_temp	Salutation	check	median_age	Cabin_ind	High_FARE	Fare_demo	Sex_female	Sex_male	Embarked_C	Embarked_Q	Embarked_S	Embarked_DEMO_Label_Encoding
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	Mr. Owen Harris	Mr.	False	30.0	0	No	7.2500	0	1	0	0	1	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	Mrs. John Bradley (Florence Briggs Thayer)	Mrs.	False	35.0	1	Yes	65.6344	1	0	1	0	0	2

In [58]:

raw_data_stg3.dtypes

Out[58]:

PassengerId                       int64
Survived                          int64
Pclass                            int64
Name                             object
Sex                              object
Age                             float64
SibSp                             int64
Parch                             int64
Ticket                           object
Fare                            float64
Cabin                            object
Embarked                         object
Salutation_temp                  object
Salutation                       object
check                              bool
median_age                      float64
Cabin_ind                         int32
High_FARE                        object
Fare_demo                       float64
Sex_female                        uint8
Sex_male                          uint8
Embarked_C                        uint8
Embarked_Q                        uint8
Embarked_S                        uint8
Embarked_DEMO_Label_Encoding      int64
dtype: object

D. Logical derivation of the new variables¶

This is one of the most important part of machine learning exercise, where data scientist needs to be creative as well as domain savvy. In-depth knowledge of business domain can help data scientist to come up with powerful and logical variables.

With limited data in this case, we don't have much scope, but then "never say no"!

Let's try to derive at least one variable!

In [59]:

raw_data_stg3['Family'] = raw_data_stg3['SibSp']  + raw_data_stg3['Parch'] 

In [60]:

to_scale = raw_data_stg3[['Family']]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(to_scale)

scaled_data_M1 = pd.DataFrame(scaled_data, columns=['Family'])
scaled_data_M1 = scaled_data_M1.rename(columns = {'Family':'Family_Scaled'})

# Standard scaler is used to remove scale from the continuous numerical variables

In [61]:

raw_data_stg3 = pd.concat([raw_data_stg3,scaled_data_M1], axis = 1)
raw_data_stg3.head()

Out[61]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Salutation_temp	Salutation	check	median_age	Cabin_ind	High_FARE	Fare_demo	Sex_female	Sex_male	Embarked_C	Embarked_S	Embarked_DEMO_Label_Encoding	Family	Family_Scaled
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	Mr. Owen Harris	Mr.	False	30.0	0	No	7.2500	0	1	0	1	1	1	0.059160
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	Mrs. John Bradley (Florence Briggs Thayer)	Mrs.	False	35.0	1	Yes	65.6344	1	0	1	0	2	1	0.059160
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	Miss. Laina	Miss.	False	21.0	0	No	7.9250	1	0	0	1	1	0	-0.560975
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	Mrs. Jacques Heath (Lily May Peel)	Mrs.	False	35.0	1	No	53.1000	1	0	0	1	1	1	0.059160
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	Mr. William Henry	Mr.	False	30.0	0	No	8.0500	0	1	0	1	1	0	-0.560975

E. Treat continuous varaibles - let's do "BINNING" on age and scaling on purchase value¶

There are multiple methods to create Bin

Plot and decide based on your judgement
Make small bin, and then group small bin on the basis of Weigth of Evidence, as explained by Naeem Siddiqui in his book - Credit Risk Scorecard
Draw a decision tree, which give a better binning based on actual data

Method 1 : Plot and decide based on your judgement¶

In [62]:

max_value = raw_data_stg3['Age'].max()
print(max_value)

80.0

In [63]:

import matplotlib.pyplot as plt
# to enable the displaying of plots in the notebook itself
%matplotlib inline
# define title
plt.title('Age Distribution')
# define xlabel
plt.xlabel('Age')
# define ylabel
plt.ylabel('Frequency')
# plot the histogram
plt.hist(raw_data_stg3['Age'], bins=99, color='orange');

In [64]:

# Method 1
# Since there is not enough data after age 50, it became a "sparse class"
#, hence we clubbed all age above 50 into one cateogry

# Let's make a functions that can be resused to make Bins, we are avoiding using pd.cut method

def bin_formation():
    for i in range(0,len(Bin_input)):
        if i == len(Bin_input)-1:
            pass
        else:
            start_value = Bin_input[i]
            end_value =  Bin_input[i+1]
            label = str(start_value) + "_" + str(end_value)
            Bin_output.append([start_value,end_value,label])

def bin_based_class(var):
    bin_formation()
    for i in range(0,len(Bin_output)):
        start_value = Bin_output[i][0]
        end_value = Bin_output[i][1]
        label_value = Bin_output[i][2]
        if var >= start_value and var <= end_value:
            var_output = label_value
            return(var_output)
        
# For using above function, we need to define two objects:
    # Bin_input - a list with desired cut values
    # Bin_output - a Blank List
    
#NOTE : This function is a copyright of Ask Analytics

In [65]:

Bin_input = [0,20,40,60,100]
Bin_output =[]

raw_data_stg3['age_bins_M1'] =  raw_data_stg3['Age'].apply(bin_based_class)

In [66]:

raw_data_stg3[['Age','age_bins_M1']].head(5)

Out[66]:

	Age	age_bins_M1
0	22.0	20_40
1	38.0	20_40
2	26.0	20_40
3	35.0	20_40
4	35.0	20_40

Method 2 - using Weight of evidence method, for which first we define the smaller bins, and then we can club micro- bins into macro-bins¶

In [67]:

# Let's first define micro-bins
Bin_input = []
for i in range (0,100,10):
    Bin_input.append(i)
Bin_output =[]

In [68]:

print(Bin_input)

[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

In [69]:

raw_data_stg3['age_bins_M2'] = raw_data_stg3['Age'].apply(bin_based_class)
raw_data_stg3[['Age','age_bins_M2']].head(5)

Out[69]:

	Age	age_bins_M2
0	22.0	20_30
1	38.0	30_40
2	26.0	20_30
3	35.0	30_40
4	35.0	30_40

In [70]:

# Let's now calculate the WOE table:

WOE_1 = ps.sqldf('''
            select age_bins_M2 
            , sum(case when Survived = 1 then 1 else 0 END) as GOOD_NUM
            , sum(case when Survived = 0 then 1 else 0 END) as BAD_NUM
            from raw_data_stg3
            group by age_bins_M2
        ''')
WOE_1

Out[70]:

	age_bins_M2	GOOD_NUM	BAD_NUM
0	0_10	40	28
1	10_20	44	71
2	20_30	120	265
3	30_40	83	89
4	40_50	33	54
5	50_60	17	25
6	60_70	4	13
7	70_80	1	4

In [71]:

sum_good =  WOE_1.GOOD_NUM.sum()
sum_bad =  WOE_1.BAD_NUM.sum()
sum_good, sum_bad

Out[71]:

(342, 549)

In [72]:

WOE_1['Good_precent'] = WOE_1.GOOD_NUM.apply(lambda x: round(x/sum_good*100,1))
WOE_1['Bad_precent']= WOE_1.BAD_NUM.apply(lambda x: round(x/sum_bad*100,1))
WOE_1

Out[72]:

	age_bins_M2	GOOD_NUM	BAD_NUM	Good_precent	Bad_precent
0	0_10	40	28	11.7	5.1
1	10_20	44	71	12.9	12.9
2	20_30	120	265	35.1	48.3
3	30_40	83	89	24.3	16.2
4	40_50	33	54	9.6	9.8
5	50_60	17	25	5.0	4.6
6	60_70	4	13	1.2	2.4
7	70_80	1	4	0.3	0.7

In [73]:

def WOE_CALC(x,y):
    if y == 0:
        z = 0
    else:
        z = mt.log(x / y)
    return(z)


WOE_CALC(20,2)

Out[73]:

2.302585092994046

In [74]:

WOE_1['WOE'] = WOE_1.apply(lambda row: WOE_CALC(row['Good_precent'], row['Bad_precent']), axis=1)
WOE_1

Out[74]:

	age_bins_M2	GOOD_NUM	BAD_NUM	Good_precent	Bad_precent	WOE
0	0_10	40	28	11.7	5.1	0.830348
1	10_20	44	71	12.9	12.9	0.000000
2	20_30	120	265	35.1	48.3	-0.319230
3	30_40	83	89	24.3	16.2	0.405465
4	40_50	33	54	9.6	9.8	-0.020619
5	50_60	17	25	5.0	4.6	0.083382
6	60_70	4	13	1.2	2.4	-0.693147
7	70_80	1	4	0.3	0.7	-0.847298

In [75]:

WOE_1['Information_Value'] = (WOE_1['Good_precent'] - WOE_1['Bad_precent'])* WOE_1['WOE']/100
WOE_1

Out[75]:

	age_bins_M2	GOOD_NUM	BAD_NUM	Good_precent	Bad_precent	WOE	Information_Value
0	0_10	40	28	11.7	5.1	0.830348	0.054803
1	10_20	44	71	12.9	12.9	0.000000	0.000000
2	20_30	120	265	35.1	48.3	-0.319230	0.042138
3	30_40	83	89	24.3	16.2	0.405465	0.032843
4	40_50	33	54	9.6	9.8	-0.020619	0.000041
5	50_60	17	25	5.0	4.6	0.083382	0.000334
6	60_70	4	13	1.2	2.4	-0.693147	0.008318
7	70_80	1	4	0.3	0.7	-0.847298	0.003389

We can also calculate the IV here to judge whether is variable is a strong variable or not:

http://www.askanalytics.in/2015/09/concept-of-woe-and-iv.html

In [76]:

WOE_1['Information_Value'].sum()
# value of 0.14 means it is a very strong variable

Out[76]:

0.14186580109668284

Here's a general guideline to interpret the strength of correlation based on IV values:

WOE_1 < 0.02: Indicates no practical predictive power. 0.02 ≤ WOE_1 < 0.1: Suggests weak predictive power. 0.1 ≤ WOE_1 < 0.3: Suggests moderate predictive power. 0.3 ≤ WOE_1 < 0.5: Indicates strong predictive power. WOE_1 ≥ 0.5: Suggests very strong predictive power.

So, from above information value, we can take that age has moderate predictive power on survived people

In [77]:

#Let's plot the WOE
sns.barplot(x="age_bins_M2", y="WOE", data=WOE_1)

Out[77]:

<AxesSubplot:xlabel='age_bins_M2', ylabel='WOE'>

In [78]:

# Using Above Plot, we can now Bin age into:
    # O to 20
    # 20 to 30
    # 30 to 40
    # 40 to 60
    # 60 and above

In [79]:

Bin_input = [0,20,30,40,60,90]
Bin_output =[]
raw_data_stg3['age_bins_M2'] = raw_data_stg3['Age'].apply(bin_based_class)
raw_data_stg3[['Age','age_bins_M2']].head(5)

Out[79]:

	Age	age_bins_M2
0	22.0	20_30
1	38.0	30_40
2	26.0	20_30
3	35.0	30_40
4	35.0	30_40

In [80]:

raw_data_stg3.head()

Out[80]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Salutation_temp	Salutation	check	median_age	Cabin_ind	High_FARE	Fare_demo	Sex_female	Sex_male	Embarked_C	Embarked_S	Embarked_DEMO_Label_Encoding	Family	Family_Scaled	age_bins_M1	age_bins_M2
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	Mr. Owen Harris	Mr.	False	30.0	0	No	7.2500	0	1	0	1	1	1	0.059160	20_40	20_30
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	Mrs. John Bradley (Florence Briggs Thayer)	Mrs.	False	35.0	1	Yes	65.6344	1	0	1	0	2	1	0.059160	20_40	30_40
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	Miss. Laina	Miss.	False	21.0	0	No	7.9250	1	0	0	1	1	0	-0.560975	20_40	20_30
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	Mrs. Jacques Heath (Lily May Peel)	Mrs.	False	35.0	1	No	53.1000	1	0	0	1	1	1	0.059160	20_40	30_40
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	Mr. William Henry	Mr.	False	30.0	0	No	8.0500	0	1	0	1	1	0	-0.560975	20_40	30_40

Method 3 - Binning with the help of decision tree model¶

In [81]:

#seperating independent and dependent variables
y = raw_data_stg3['Survived']
x = raw_data_stg3['Age'].values.reshape(-1,1)

In [82]:

#importing decision tree classifier 
from sklearn import tree 
from sklearn.tree import DecisionTreeClassifier
#creating the decision tree function
dt_model = DecisionTreeClassifier(random_state=10, max_depth=4)
#fitting the model
model = dt_model.fit(x, y)
plt.figure(figsize=(12,12))
tree.plot_tree(dt_model)
plt.show()

From Above Tree, we can identify the optimal bucket for age to be <=10, 30 to 50 and More than 50¶

In [83]:

Bin_input = [0,10,30,50,100]
Bin_output =[]
raw_data_stg3['age_bins_M3'] = raw_data_stg3['Age'].apply(bin_based_class)
raw_data_stg3[['Age','age_bins_M3']].head(5)

Out[83]:

	Age	age_bins_M3
0	22.0	10_30
1	38.0	30_50
2	26.0	10_30
3	35.0	30_50
4	35.0	30_50

In [84]:

raw_data_stg3.age_bins_M3.unique()

Out[84]:

array(['10_30', '30_50', '50_100', '0_10'], dtype=object)

In [85]:

# Having learnt 3 methods for Binning, let use method 3 Bins. We now need to label encode the bin
# since variable is ordinal in nature

raw_data_stg3['age_encoded'] = raw_data_stg3.age_bins_M3.map({'0_10'   : 1,
                                                              '10_30'  : 2,
                                                              '30_50' :  3,
                                                              '50_100' : 4})
raw_data_stg3.head(2)

Out[85]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Salutation_temp	Salutation	check	median_age	Cabin_ind	High_FARE	Fare_demo	Sex_female	Sex_male	Embarked_C	Embarked_Q	Embarked_S	Embarked_DEMO_Label_Encoding	Family	Family_Scaled	age_bins_M1	age_bins_M2	age_bins_M3	age_encoded
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	Mr. Owen Harris	Mr.	False	30.0	0	No	7.2500	0	1	0	0	1	1	1	0.05916	20_40	20_30	10_30	2
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	Mrs. John Bradley (Florence Briggs Thayer)	Mrs.	False	35.0	1	Yes	65.6344	1	0	1	0	0	2	1	0.05916	20_40	30_40	30_50	3

F. Variable Scaling¶

There are two popular methods for scaling

Standard Scaling (using Mean and Standard deviation) - In this method output values can be both -ve and +ve

Min-Max Scaling - value in this method comes in range of 0 to 1

Maths behind Standard Scaling Z = (X - Mean) / Standard Deviation

Maths behind Min-Max Scaling K = (X - Min) / (Max - min)

Method 1 - Standard Scaling¶

In [86]:

to_scale = raw_data_stg3[['Fare']]

In [87]:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(to_scale)

scaled_data_M1 = pd.DataFrame(scaled_data, columns=['Fare_M1'])
scaled_data_M1.head()

Out[87]:

	Fare_M1
0	-0.502445
1	0.786845
2	-0.488854
3	0.420730
4	-0.486337

Method 2 - Min-Max Scaling¶

In [88]:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(to_scale)
scaled_data_M2 = pd.DataFrame(scaled_data, columns=['Fare_M2'])
scaled_data_M2.head()

Out[88]:

	Fare_M2
0	0.014151
1	0.139136
2	0.015469
3	0.103644
4	0.015713

In [89]:

# Scaling doesnt change the distribution of the variables, we can visualize it
both_scaled_vars = pd.concat([scaled_data_M1,scaled_data_M2],axis = 1)
both_scaled_vars.shape

Out[89]:

(891, 2)

In [90]:

sns.relplot(x="Fare_M1", y="Fare_M2", data=both_scaled_vars, kind="scatter")
# You will find the variables are 100% correlated

Out[90]:

<seaborn.axisgrid.FacetGrid at 0x251f155bac0>

In [91]:

raw_data_stg3.shape, both_scaled_vars.shape

Out[91]:

((891, 31), (891, 2))

In [92]:

raw_data_stg3 = raw_data_stg3.reset_index(drop=True)
both_scaled_vars = both_scaled_vars.reset_index(drop=True)
raw_data_stg3 = pd.concat([raw_data_stg3,both_scaled_vars], axis = 1)

In [93]:

temp = raw_data_stg3[['Survived','Pclass','age_encoded','Family_Scaled','Fare_M2','Cabin_ind','Sex_female','Embarked_C','Embarked_Q']]
temp.head()

Out[93]:

	Survived	Pclass	age_encoded	Family_Scaled	Fare_M2	Cabin_ind	Sex_female	Embarked_C
0	0	3	2	0.059160	0.014151	0	0	0
1	1	1	3	0.059160	0.139136	1	1	1
2	1	3	2	-0.560975	0.015469	0	1	0
3	1	1	3	0.059160	0.103644	1	1	0
4	0	3	3	-0.560975	0.015713	0	0	0

In [94]:

corr_matrix = temp.corr()

In [95]:

# Plot the correlation matrix using seaborn
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm',annot_kws={"size": 10})

Out[95]:

<AxesSubplot:>

In [96]:

# Since Pclass and Cabin_Ind are highly correlated, it is suggested not to take both variables in the model
# As it would lead to multicollinearity

F. Variables interaction¶

Often variables inherently are not strong explainers, but become very stronger varriables when combined with each other.

In [97]:

# This we will cover in some other blog

Module 4 - Model Creation¶

Let's first build a logistic regression model as our first classification model and discuss in details
We will build other models - Decision Tree, Random Forest, XGBoost, SVM etc.

The Logistic Regression Model¶

In [98]:

raw_data_stg3.columns

Out[98]:

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Salutation_temp',
       'Salutation', 'check', 'median_age', 'Cabin_ind', 'High_FARE',
       'Fare_demo', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q',
       'Embarked_S', 'Embarked_DEMO_Label_Encoding', 'Family', 'Family_Scaled',
       'age_bins_M1', 'age_bins_M2', 'age_bins_M3', 'age_encoded', 'Fare_M1',
       'Fare_M2'],
      dtype='object')

In [99]:

# First let's select the variables to be used, removing extra varaibles that we creted for demonstration purpuse only
raw_data_stg4 = raw_data_stg3[['Pclass', 'Sex_female','Family_Scaled',
                                'age_encoded','Fare_M2','Survived']]
raw_data_stg4.head()

Out[99]:

	Pclass	Sex_female	Family_Scaled	age_encoded	Fare_M2	Survived
0	3	0	0.059160	2	0.014151	0
1	1	1	0.059160	3	0.139136	1
2	3	1	-0.560975	2	0.015469	1
3	1	1	0.059160	3	0.103644	1
4	3	0	-0.560975	3	0.015713	0

As a first step of modeling exercise, we need to create separate data for¶

- y variable
- x variables

In [100]:

x = raw_data_stg4.drop(['Survived'], axis =1)
y = raw_data_stg4['Survived']
x.shape, y.shape

Out[100]:

((891, 5), (891,))

In [101]:

# Importing from sklearn, train test split function
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(x,y, test_size = 0.3, random_state = 45,stratify = y)

X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, random_state= 1234, test_size = 0.5)

Train data - The data on which we will build the model
Test data - The data on which we will check the performance of model
test_size defines the proportion of data you want to keep for testing/validation
random_state - If not defined, everytime you run the code, since selection is random, you get a new sample everytime. To make the sampling static, we use it
stratify - Maintains the proportion of 1 and 0 same across the traing and test data

In the above code, we broke data into 3 portions:

Train data - 70%
Test data = 15%
Validation data = 15%

Validation is to test FINAL model on completely unseen data

In [102]:

X_train.shape, X_test.shape, X_val.shape

Out[102]:

((623, 5), (134, 5), (134, 5))

In [103]:

y_train.value_counts(), y_test.value_counts(), y_val.value_counts()

Out[103]:

(0    384
 1    239
 Name: Survived, dtype: int64,
 0    87
 1    47
 Name: Survived, dtype: int64,
 0    78
 1    56
 Name: Survived, dtype: int64)

Before running the model, it is useful to remove the variables which have "Multicollinearity"¶

If independent variables have relation(s) among each other, it is best practice to remove such variables
In case of having Multicollinearity, the beta coefficients are not true representative of the relationship between y and X

In [104]:

#importing the model building and model evaluation libaries
import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, confusion_matrix, roc_curve
from sklearn.metrics import precision_score, recall_score, precision_recall_curve
from sklearn.feature_selection import RFE

In [105]:

# Creating an instance of Logistic Regresssion
Logistic_M = LogisticRegression()
# Fitting the Logistic Regresssion model
Logistic_model = Logistic_M.fit(X_train, y_train)

In [106]:

# Dumping the model in a pickle file so that it can be reuse it without the need to re-train it,
# and also can be deployed for production.

import joblib
joblib.dump(Logistic_model, 'Logistic_model.pkl')

Out[106]:

['Logistic_model.pkl']

In [107]:

# To predict the model output, in term of 1-0
train_prediction_1_0 = Logistic_model.predict(X_train)
train_prediction_1_0[1:10]

Out[107]:

array([1, 0, 1, 0, 1, 1, 0, 0, 1], dtype=int64)

In [108]:

# To predict the model output, in term of probability
train_prediction_p = Logistic_model.predict_proba(X_train)
train_prediction_p[1:10]
# Above gives probability of 0 and 1 respectively

Out[108]:

array([[0.39238269, 0.60761731],
       [0.62347228, 0.37652772],
       [0.15249208, 0.84750792],
       [0.97047798, 0.02952202],
       [0.26806558, 0.73193442],
       [0.33167873, 0.66832127],
       [0.87780636, 0.12219364],
       [0.71050905, 0.28949095],
       [0.33153475, 0.66846525]])

In [109]:

# To see probability of 1:
train_prediction_1 = train_prediction_p[:,1]
train_prediction_1[1:10]

Out[109]:

array([0.60761731, 0.37652772, 0.84750792, 0.02952202, 0.73193442,
       0.66832127, 0.12219364, 0.28949095, 0.66846525])

In [110]:

# Accuracy of the model on training, test and validation data
print(Logistic_model.score(X_train, y_train)*100, Logistic_model.score(X_test, y_test)*100 )

80.73836276083468 76.86567164179104

In [111]:

# Another method to calculate the accuracy score
accuracy_score(y_train,Logistic_model.predict(X_train)), accuracy_score(y_test,Logistic_model.predict(X_test))

Out[111]:

(0.8073836276083467, 0.7686567164179104)

In [112]:

# Create the RFE (Recursive Feature Elimination) object and rank each feature
rfe = RFE(estimator=Logistic_M, n_features_to_select=1, step = 1)
rfe.fit(X_train, y_train)

ranking_df = pd.DataFrame()
ranking_df['Feature_name'] = X_train.columns
ranking_df['Rank'] = rfe.ranking_

ranked = ranking_df.sort_values(by=['Rank'])
ranked

Out[112]:

	Feature_name	Rank
1	Sex_female	1
0	Pclass	2
4	Fare_M2	3
3	age_encoded	4
2	Family_Scaled	5

In [113]:

# We have built a modular function for printing the variables coefficient in the model
def print_variables_coeff(model,X):
    df = pd.DataFrame(columns=['Variable','Coefficient'])
    intercept = round(model.intercept_[0],3)
    new_row = {'Variable'            : 'Intercept'
                   , 'Coefficient'        : intercept
                  }
    df = df.append(new_row, ignore_index=True)
    
    for i in range(len(X.columns)):
        varaible_name = X.columns[i]
        coefficient  = round(model.coef_[0,i],3)
        
        new_row = {'Variable'            : varaible_name
                   , 'Coefficient'        : coefficient
                  }
        df = df.append(new_row, ignore_index=True)
    return(df)        

In [114]:

print_variables_coeff(Logistic_model,X_train)

Out[114]:

	Variable	Coefficient
0	Intercept	1.863
1	Pclass	-1.007
2	Sex_female	2.671
3	Family_Scaled	-0.345
4	age_encoded	-0.510
5	Fare_M2	0.918

In [115]:

plt.figure(figsize=(9, 5), dpi=120, facecolor='w', edgecolor='b')
x = range(len(X_train.columns))
c = Logistic_model.coef_.reshape(-1)
plt.bar( x, c )
plt.xlabel( "Variables")
plt.ylabel('Coefficients')
plt.title('Coefficient plot')
plt.xticks(x, X_train.columns)

Out[115]:

([<matplotlib.axis.XTick at 0x251f1ccd8b0>,
  <matplotlib.axis.XTick at 0x251f1ccd910>,
  <matplotlib.axis.XTick at 0x251f1cd83d0>,
  <matplotlib.axis.XTick at 0x251f1d1b0a0>,
  <matplotlib.axis.XTick at 0x251f1d1b760>],
 [Text(0, 0, 'Pclass'),
  Text(1, 0, 'Sex_female'),
  Text(2, 0, 'Family_Scaled'),
  Text(3, 0, 'age_encoded'),
  Text(4, 0, 'Fare_M2')])

In [116]:

# Calculating F1-score:
# The F1-score is a widely used metric in machine learning and statistics that combines 
# both precision and recall into a single value. 
# The F1-score ranges between 0 and 1, where a higher value indicates better model 

print(f1_score(Logistic_model.predict(X_train), y_train)
      , f1_score(Logistic_model.predict(X_test), y_test)
      )

0.7413793103448277 0.651685393258427

In [117]:

from sklearn.metrics import confusion_matrix
cf = confusion_matrix(y_test, Logistic_model.predict(X_test))
print(cf)

[[74 13]
 [18 29]]

In [118]:

from sklearn.metrics import classification_report as rep
print(rep( y_test , Logistic_model.predict(X_test) ))

              precision    recall  f1-score   support

           0       0.80      0.85      0.83        87
           1       0.69      0.62      0.65        47

    accuracy                           0.77       134
   macro avg       0.75      0.73      0.74       134
weighted avg       0.76      0.77      0.77       134

In [119]:

# Calculating accuracy at particular probability cut-off level
model = LogisticRegression()
model.fit(X_train, y_train)
train_probs = model.predict_proba(X_train)[:, 1]
train_preds = (train_probs >= 0.1).astype(int)
train_preds
cm = confusion_matrix(y_train,train_preds)
train_accuracy = accuracy_score(y_train, train_preds)
train_accuracy

Out[119]:

0.49759229534510435

We have built a cool function that Iterates probability cut-off value from 0 to 1 by 0.1 and at every cut-off value calculates - precision, recall, f1 score, TPR, TNR, FPR, FNR, Accuracy, for train and test¶

In [120]:

def classification_table_AA(model,X, y):
    
    df = pd.DataFrame(columns=['Threshold','TP', 'FP','FN','TN','Accuracy','Precision','TPR_Recall','F1_Score'])
    # Create an array of probability thresholds from 0 to 1 by 0.1
    thresholds = np.arange(0, 1.1, 0.1)
    
    for threshold in thresholds:
          
        # Predict probabilities for train, test, and validation sets
        probabilities = model.predict_proba(X)[:, 1]
        
        # Convert probabilities to binary predictions based on the threshold
        predicted_values = (probabilities >= threshold).astype(int)

        # Compute evaluation metrics
        precision = precision_score(y, predicted_values)
        recall = recall_score(y, predicted_values)
        f1 = f1_score(y, predicted_values)
        confusion = confusion_matrix(y, predicted_values)
        
        # Compute additional metrics
        tpr = recall
        tnr = confusion[0, 0] / (confusion[0, 0] + confusion[0, 1])
        fpr = 1 - tnr
        fnr = 1 - tpr
        accuracy = accuracy_score(y, predicted_values)
        
        # Saving the information in a dataframe
        new_row = {'Threshold'    : threshold
                   , 'TP'         : confusion[0][0]
                   , 'FP'         : confusion[0][1]
                   , 'FN'         : confusion[1][0]
                   , 'TN'         : confusion[1][1]  
                   , 'Accuracy'   : accuracy
                   , 'Precision'  : precision
                   , 'TPR_Recall' :  recall
                   , 'F1_Score'   : f1
                  }
        df = df.append(new_row, ignore_index=True)
    return df

In [121]:

train_result = classification_table_AA(Logistic_model,X_train, y_train)
train_result

Out[121]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	384.0	0.0	239.0	0.383628	0.383628	1.000000	0.554524
1	0.1	80.0	304.0	9.0	230.0	0.497592	0.430712	0.962343	0.595084
2	0.2	242.0	142.0	36.0	203.0	0.714286	0.588406	0.849372	0.695205
3	0.3	291.0	93.0	41.0	198.0	0.784912	0.680412	0.828452	0.747170
4	0.4	315.0	69.0	56.0	183.0	0.799358	0.726190	0.765690	0.745418
5	0.5	331.0	53.0	67.0	172.0	0.807384	0.764444	0.719665	0.741379
6	0.6	352.0	32.0	85.0	154.0	0.812199	0.827957	0.644351	0.724706
7	0.7	376.0	8.0	130.0	109.0	0.778491	0.931624	0.456067	0.612360
8	0.8	379.0	5.0	150.0	89.0	0.751204	0.946809	0.372385	0.534535
9	0.9	381.0	3.0	201.0	38.0	0.672552	0.926829	0.158996	0.271429
10	1.0	384.0	0.0	239.0	0.0	0.616372	0.000000	0.000000	0.000000

How to read and interpret above table:¶

If data is imbalanced - We either check F1 Score (Higher is better), or we try to make the data balanced by methods such as under-sampling, over-sampling
Accuracy = (correctly predicted)/ total observations
Precision - Precision is important, when
Recall - This is also called TPR (True Positive rate)

We can decide the probability cut off value as per our need by analyzing the above table.

CM%20Explanation.jpg

In [122]:

test_result = classification_table_AA(Logistic_model,X_test, y_test)
test_result

Out[122]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	87.0	0.0	47.0	0.350746	0.350746	1.000000	0.519337
1	0.1	13.0	74.0	3.0	44.0	0.425373	0.372881	0.936170	0.533333
2	0.2	46.0	41.0	9.0	38.0	0.626866	0.481013	0.808511	0.603175
3	0.3	58.0	29.0	14.0	33.0	0.679104	0.532258	0.702128	0.605505
4	0.4	68.0	19.0	15.0	32.0	0.746269	0.627451	0.680851	0.653061
5	0.5	74.0	13.0	18.0	29.0	0.768657	0.690476	0.617021	0.651685
6	0.6	82.0	5.0	19.0	28.0	0.820896	0.848485	0.595745	0.700000
7	0.7	86.0	1.0	22.0	25.0	0.828358	0.961538	0.531915	0.684932
8	0.8	87.0	0.0	28.0	19.0	0.791045	1.000000	0.404255	0.575758
9	0.9	87.0	0.0	37.0	10.0	0.723881	1.000000	0.212766	0.350877
10	1.0	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000

In [123]:

# Ask Analytics designed a proprietary function for ploting ROC curve - it's our patent!

from sklearn.metrics import roc_curve
def ROC_Curve (model,y_train,X_train,y_test,X_test):
    probabilities_train = model.predict_proba(X_train)[:, 1]
    
    probabilities_test = model.predict_proba(X_test)[:, 1]

    fpr_train, tpr_train, _ = roc_curve(y_train, probabilities_train)
    auc_train = roc_auc_score(y_train, probabilities_train)
    
    fpr_test, tpr_test, _ = roc_curve(y_test, probabilities_test)
    auc_test = roc_auc_score(y_test, probabilities_test)
    
    plt.figure(figsize=(9,5)) 
    plt.plot(fpr_train,tpr_train,label = "Train AUC-ROC="+str(auc_train)) 
    plt.plot(fpr_test,tpr_test,label = "Test AUC-ROC="+str(auc_test)) 
    x = np.linspace(0, 1, 1000)
    plt.plot(x, x, linestyle='-')
    plt.xlabel('False Positive Rate') 
    plt.ylabel('True Positive Rate') 
    plt.legend(loc=4) 
    plt.show()

In [124]:

# You can plot the ROC curve for any model
ROC_Curve (Logistic_model,y_train,X_train,y_test,X_test)

Perform Hyper-parameter Tuning on Logistic Regression¶

As a matter of practice, we first try using multiple methods of machine learning. Once we finalize the model, then we do hyper-parameter tuning on the final model. Here for demo purpose, we have done it on all models.

In [125]:

# Function to print the cross validation output in readable format

def CV_Print(results):
    df = pd.DataFrame(columns=['Params','Means','Stds'])
    
    x = results.best_params_
    
    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        new_row = {'Params'      : params
                   , 'Means'         : mean
                   , 'Stds'        : std
                  }
        df = df.append(new_row, ignore_index=True)
    return df

In [126]:

parameters = {'C': [0.001, 0.01, 0.1, 1, 10, 20, 30,40, 50, 100, 1000, 10000]}
cv_logistic = GridSearchCV(model, parameters, cv = 5)
cv_logistic.fit(X_train, y_train.values.ravel())

CV_Print(cv_logistic)

Out[126]:

	Params	Means	Stds
0	{'C': 0.001}	0.616374	0.002591
1	{'C': 0.01}	0.723884	0.011665
2	{'C': 0.1}	0.800865	0.030033
3	{'C': 1}	0.800852	0.031379
4	{'C': 10}	0.800865	0.026854
5	{'C': 20}	0.800865	0.026854
6	{'C': 30}	0.800865	0.026854
7	{'C': 40}	0.802477	0.024361
8	{'C': 50}	0.802477	0.024361
9	{'C': 100}	0.802477	0.024361
10	{'C': 1000}	0.802477	0.024361
11	{'C': 10000}	0.802477	0.024361

Grid search within K-fold cross-validation is used in order to find the optimal hyper parameter settings for logistic regression that generates the best model on our data.

In [127]:

cv_logistic.best_estimator_

Out[127]:

LogisticRegression(C=40)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [128]:

Logistic_model_cv = cv_logistic.fit(X_train, y_train.values.ravel())

In [129]:

Logistic_model_cv.score(X_train, y_train), Logistic_model_cv.score(X_test, y_test)

Out[129]:

(0.8073836276083467, 0.753731343283582)

In [130]:

# Let's use the function we built already to check the variables coefficients
# In CV model, we just need to add "best_estimator_" after model name in this

print_variables_coeff(Logistic_model_cv.best_estimator_,X_train)

Out[130]:

	Variable	Coefficient
0	Intercept	1.696
1	Pclass	-0.970
2	Sex_female	2.820
3	Family_Scaled	-0.410
4	age_encoded	-0.536
5	Fare_M2	2.240

In [131]:

train_result = classification_table_AA(Logistic_model_cv,X_train, y_train)
train_result

Out[131]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	384.0	0.0	239.0	0.383628	0.383628	1.000000	0.554524
1	0.1	93.0	291.0	12.0	227.0	0.513644	0.438224	0.949791	0.599736
2	0.2	243.0	141.0	36.0	203.0	0.715891	0.590116	0.849372	0.696398
3	0.3	293.0	91.0	44.0	195.0	0.783307	0.681818	0.815900	0.742857
4	0.4	314.0	70.0	56.0	183.0	0.797753	0.723320	0.765690	0.743902
5	0.5	330.0	54.0	66.0	173.0	0.807384	0.762115	0.723849	0.742489
6	0.6	350.0	34.0	84.0	155.0	0.810594	0.820106	0.648536	0.724299
7	0.7	373.0	11.0	125.0	114.0	0.781701	0.912000	0.476987	0.626374
8	0.8	379.0	5.0	149.0	90.0	0.752809	0.947368	0.376569	0.538922
9	0.9	381.0	3.0	190.0	49.0	0.690209	0.942308	0.205021	0.336770
10	1.0	384.0	0.0	239.0	0.0	0.616372	0.000000	0.000000	0.000000

In [132]:

# You can plot the ROC curve at any probability cut-off
ROC_Curve (Logistic_model_cv,y_train,X_train,y_test,X_test)

In [133]:

train_probs = Logistic_model_cv.predict_proba(X_train)[:, 1]
train_probs[1:10]

Out[133]:

array([0.83039557, 0.39157563, 0.8564838 , 0.02282676, 0.75015313,
       0.68953242, 0.1166457 , 0.27666506, 0.68987168])

Decision Tree and Prune the tree with Hyperparameter Tunning¶

In [134]:

from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier(random_state = 1234)

In [135]:

DT_model = DT.fit(X_train, y_train)

In [136]:

DT_model.score(X_train, y_train), DT_model.score(X_test, y_test)

Out[136]:

(0.9454253611556982, 0.8208955223880597)

In [137]:

# Let's visualize the Tree Model, it's always relaxing to watch natural scenery

In [138]:

plt.figure(figsize = (50,40))
text_params = {'fontsize': 10} 
tree.plot_tree(DT_model, 
               feature_names = list(X_train.columns), 
               class_names = ['0','1'],
               filled = True);

In [139]:

plt.figure(figsize = (15,15))
text_params = {'fontsize': 12} 
tree.plot_tree(DT_model, 
               feature_names = list(X_train.columns), 
               class_names = ['0','1'],
               filled = True
              , max_depth = 2);

In [140]:

# Check the variables which are coming as most important variables in the model
importance = DT_model.feature_importances_
feature_importance = pd.Series(importance, index = X_train.columns)
feature_importance = feature_importance.sort_values(ascending= [False])
feature_importance.plot(kind = 'bar')
plt.ylabel('Importance')

Out[140]:

Text(0, 0.5, 'Importance')

Prune the classification tree with Hyperparameter Tunning¶

In [141]:

grid = {'max_depth': [2, 3, 4, 5],
         'min_samples_split': [2, 3, 4],
         'min_samples_leaf': range(5, 10)}

In [142]:

from sklearn.model_selection import GridSearchCV
classifier = DecisionTreeClassifier(random_state = 1234)
DT_CV = GridSearchCV(estimator = classifier, param_grid = grid)
DT_CV.fit(X_train, y_train)

Out[142]:

GridSearchCV(estimator=DecisionTreeClassifier(random_state=1234),
             param_grid={'max_depth': [2, 3, 4, 5],
                         'min_samples_leaf': range(5, 10),
                         'min_samples_split': [2, 3, 4]})

In [143]:

DT_CV.best_estimator_

Out[143]:

DecisionTreeClassifier(max_depth=5, min_samples_leaf=8, random_state=1234)

In [144]:

DT_model_CV = DT_CV.fit(X_train, y_train.values.ravel())

In [145]:

DT_model_CV.score(X_train, y_train), DT_model_CV.score(X_test, y_test)

Out[145]:

(0.8491171749598716, 0.8059701492537313)

In [146]:

train_result = classification_table_AA(DT_model_CV,X_train, y_train)
train_result

Out[146]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	384.0	0.0	239.0	0.383628	0.383628	1.000000	0.554524
1	0.1	105.0	279.0	4.0	235.0	0.545746	0.457198	0.983264	0.624170
2	0.2	303.0	81.0	31.0	208.0	0.820225	0.719723	0.870293	0.787879
3	0.3	315.0	69.0	35.0	204.0	0.833066	0.747253	0.853556	0.796875
4	0.4	315.0	69.0	35.0	204.0	0.833066	0.747253	0.853556	0.796875
5	0.5	345.0	39.0	55.0	184.0	0.849117	0.825112	0.769874	0.796537
6	0.6	345.0	39.0	55.0	184.0	0.849117	0.825112	0.769874	0.796537
7	0.7	369.0	15.0	97.0	142.0	0.820225	0.904459	0.594142	0.717172
8	0.8	373.0	11.0	107.0	132.0	0.810594	0.923077	0.552301	0.691099
9	0.9	383.0	1.0	155.0	84.0	0.749599	0.988235	0.351464	0.518519
10	1.0	384.0	0.0	168.0	71.0	0.730337	1.000000	0.297071	0.458065

In [147]:

# You can plot the ROC curve at any probability cut-off
ROC_Curve (DT_model_CV,y_train,X_train,y_test,X_test)

Random Forest Model and perform Hyperparameter Tunning¶

In [148]:

from sklearn.ensemble import RandomForestClassifier
# let's build a forst with 500 trees
RF_model = RandomForestClassifier(n_estimators = 500, random_state = 42)
# Train the model on training data
RF_model.fit(X_train, y_train);

In [149]:

train_RF = RF_model.predict(X_train)
train_RF[1:10]

Out[149]:

array([1, 0, 1, 0, 1, 0, 0, 0, 0], dtype=int64)

In [150]:

train_probs_RF = RF_model.predict_proba(X_train)[:,1]
train_probs_RF[1:10]

Out[150]:

array([0.77      , 0.098     , 0.89440238, 0.041     , 0.976     ,
       0.36282056, 0.11171319, 0.01733333, 0.1383619 ])

In [151]:

train_result = classification_table_AA(RF_model,X_train, y_train)
train_result

Out[151]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	384.0	0.0	239.0	0.383628	0.383628	1.000000	0.554524
1	0.1	264.0	120.0	3.0	236.0	0.802568	0.662921	0.987448	0.793277
2	0.2	306.0	78.0	6.0	233.0	0.865169	0.749196	0.974895	0.847273
3	0.3	345.0	39.0	13.0	226.0	0.916533	0.852830	0.945607	0.896825
4	0.4	369.0	15.0	19.0	220.0	0.945425	0.936170	0.920502	0.928270
5	0.5	372.0	12.0	22.0	217.0	0.945425	0.947598	0.907950	0.927350
6	0.6	376.0	8.0	26.0	213.0	0.945425	0.963801	0.891213	0.926087
7	0.7	382.0	2.0	51.0	188.0	0.914928	0.989474	0.786611	0.876457
8	0.8	382.0	2.0	69.0	170.0	0.886035	0.988372	0.711297	0.827251
9	0.9	384.0	0.0	105.0	134.0	0.831461	1.000000	0.560669	0.718499
10	1.0	384.0	0.0	210.0	29.0	0.662921	1.000000	0.121339	0.216418

In [152]:

train_result = classification_table_AA(RF_model,X_test, y_test)
train_result

Out[152]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	87.0	0.0	47.0	0.350746	0.350746	1.000000	0.519337
1	0.1	46.0	41.0	9.0	38.0	0.626866	0.481013	0.808511	0.603175
2	0.2	59.0	28.0	10.0	37.0	0.716418	0.569231	0.787234	0.660714
3	0.3	68.0	19.0	11.0	36.0	0.776119	0.654545	0.765957	0.705882
4	0.4	74.0	13.0	11.0	36.0	0.820896	0.734694	0.765957	0.750000
5	0.5	76.0	11.0	11.0	36.0	0.835821	0.765957	0.765957	0.765957
6	0.6	79.0	8.0	16.0	31.0	0.820896	0.794872	0.659574	0.720930
7	0.7	81.0	6.0	18.0	29.0	0.820896	0.828571	0.617021	0.707317
8	0.8	83.0	4.0	21.0	26.0	0.813433	0.866667	0.553191	0.675325
9	0.9	84.0	3.0	29.0	18.0	0.761194	0.857143	0.382979	0.529412
10	1.0	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000

In [153]:

# Get numerical feature importances
def feature_imp_RF(model,x):
    df = pd.DataFrame(columns=['Variable','Importance'])
    importances = list(model.feature_importances_)
    col_list = list(x.columns)
    
    for i in range(len(col_list)):
        new_row = {'Variable'            : col_list[i]
                   , 'Importance'        : importances[i]
                  }

        df = df.append(new_row, ignore_index=True)
    return df    

In [154]:

X = feature_imp_RF(RF_model,X_train)
X

Out[154]:

	Variable	Importance
0	Pclass	0.093017
1	Sex_female	0.308676
2	Family_Scaled	0.095520
3	age_encoded	0.091763
4	Fare_M2	0.411024

In [155]:

# Let's plot ROC curve for RF Model
ROC_Curve(RF_model,y_train,X_train,y_test,X_test)

This is a classical case of over-fitting model

Model is fitting on train data quite pefectly, but not as good on test data
Here "Bias" is low, but "Variance" is high

In [156]:

RF_CV = RandomForestClassifier()
parameters = {
    'n_estimators': [5, 10, 25, 50, 250, 500],
    'max_depth': [2, 4, 8, 16, 32, None]
}

RF_CV = GridSearchCV(RF_CV, parameters, cv = 5)
RF_CV.fit(X_train, y_train.values.ravel())

CV_Print(RF_CV)

Out[156]:

	Params	Means	Stds
0	{'max_depth': 2, 'n_estimators': 5}	0.775277	0.024315
1	{'max_depth': 2, 'n_estimators': 10}	0.796090	0.017231
2	{'max_depth': 2, 'n_estimators': 25}	0.789703	0.019460
3	{'max_depth': 2, 'n_estimators': 50}	0.780103	0.010678
4	{'max_depth': 2, 'n_estimators': 250}	0.781639	0.019288
5	{'max_depth': 2, 'n_estimators': 500}	0.780026	0.019526
6	{'max_depth': 4, 'n_estimators': 5}	0.800955	0.008008
7	{'max_depth': 4, 'n_estimators': 10}	0.808968	0.006605
8	{'max_depth': 4, 'n_estimators': 25}	0.821832	0.011771
9	{'max_depth': 4, 'n_estimators': 50}	0.807355	0.017104
10	{'max_depth': 4, 'n_estimators': 250}	0.813819	0.013578
11	{'max_depth': 4, 'n_estimators': 500}	0.820206	0.014170
12	{'max_depth': 8, 'n_estimators': 5}	0.829819	0.019591
13	{'max_depth': 8, 'n_estimators': 10}	0.828271	0.011673
14	{'max_depth': 8, 'n_estimators': 25}	0.841123	0.013446
15	{'max_depth': 8, 'n_estimators': 50}	0.834684	0.016448
16	{'max_depth': 8, 'n_estimators': 250}	0.845935	0.013509
17	{'max_depth': 8, 'n_estimators': 500}	0.844310	0.009476
18	{'max_depth': 16, 'n_estimators': 5}	0.800929	0.020204
19	{'max_depth': 16, 'n_estimators': 10}	0.802529	0.016170
20	{'max_depth': 16, 'n_estimators': 25}	0.818619	0.003878
21	{'max_depth': 16, 'n_estimators': 50}	0.820206	0.009919
22	{'max_depth': 16, 'n_estimators': 250}	0.829858	0.011720
23	{'max_depth': 16, 'n_estimators': 500}	0.829819	0.017474
24	{'max_depth': 32, 'n_estimators': 5}	0.808981	0.017170
25	{'max_depth': 32, 'n_estimators': 10}	0.820245	0.027074
26	{'max_depth': 32, 'n_estimators': 25}	0.813794	0.012971
27	{'max_depth': 32, 'n_estimators': 50}	0.818606	0.019444
28	{'max_depth': 32, 'n_estimators': 250}	0.831458	0.005103
29	{'max_depth': 32, 'n_estimators': 500}	0.833071	0.009221
30	{'max_depth': None, 'n_estimators': 5}	0.805794	0.011494
31	{'max_depth': None, 'n_estimators': 10}	0.815419	0.014198
32	{'max_depth': None, 'n_estimators': 25}	0.815355	0.032781
33	{'max_depth': None, 'n_estimators': 50}	0.823419	0.013581
34	{'max_depth': None, 'n_estimators': 250}	0.828219	0.014331
35	{'max_depth': None, 'n_estimators': 500}	0.831445	0.009045

In [157]:

RF_CV.best_estimator_

Out[157]:

RandomForestClassifier(max_depth=8, n_estimators=250)

In [158]:

RF_model_CV = RF_CV.fit(X_train, y_train.values.ravel())

In [159]:

RF_model_CV.score(X_train, y_train), RF_model_CV.score(X_test, y_test)

Out[159]:

(0.913322632423756, 0.7985074626865671)

In [160]:

train_result = classification_table_AA(RF_model_CV,X_test, y_test)
train_result

Out[160]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	87.0	0.0	47.0	0.350746	0.350746	1.000000	0.519337
1	0.1	37.0	50.0	7.0	40.0	0.574627	0.444444	0.851064	0.583942
2	0.2	63.0	24.0	9.0	38.0	0.753731	0.612903	0.808511	0.697248
3	0.3	68.0	19.0	13.0	34.0	0.761194	0.641509	0.723404	0.680000
4	0.4	72.0	15.0	13.0	34.0	0.791045	0.693878	0.723404	0.708333
5	0.5	76.0	11.0	16.0	31.0	0.798507	0.738095	0.659574	0.696629
6	0.6	80.0	7.0	16.0	31.0	0.828358	0.815789	0.659574	0.729412
7	0.7	84.0	3.0	18.0	29.0	0.843284	0.906250	0.617021	0.734177
8	0.8	87.0	0.0	21.0	26.0	0.843284	1.000000	0.553191	0.712329
9	0.9	87.0	0.0	28.0	19.0	0.791045	1.000000	0.404255	0.575758
10	1.0	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000

In [161]:

# Let's plot ROC curve for RF Model
ROC_Curve(RF_model_CV,y_train,X_train,y_test,X_test)

Plotting the 2D and 3D contour to visualize - This is jazzy, but not very useful¶

In [162]:

grid_results = pd.concat([pd.DataFrame(RF_CV.cv_results_["params"])
                        , pd.DataFrame(RF_CV.cv_results_["mean_test_score"]
                        , columns=["Accuracy"])],axis=1)
grid_results.head()

Out[162]:

	max_depth	n_estimators	Accuracy
0	2.0	5	0.762477
1	2.0	10	0.794542
2	2.0	25	0.788026
3	2.0	50	0.776865
4	2.0	250	0.773613

In [163]:

grid_contour = grid_results.groupby(['max_depth','n_estimators']).mean()
grid_contour.head()

Out[163]:

		Accuracy
max_depth	n_estimators
2.0	5	0.762477
	10	0.794542
	25	0.788026
	50	0.776865
	250	0.773613

In [164]:

### Pivoting the data:
grid_reset = grid_contour.reset_index()
grid_reset.columns = ['max_depth', 'n_estimators', 'Accuracy']
grid_pivot = grid_reset.pivot('max_depth', 'n_estimators')
grid_pivot

Out[164]:

	Accuracy
n_estimators	5	10	25	50	250	500
max_depth
2.0	0.762477	0.794542	0.788026	0.776865	0.773613	0.778465
4.0	0.808955	0.815342	0.821819	0.813794	0.808994	0.823432
8.0	0.825045	0.831471	0.841097	0.833071	0.841110	0.839497
16.0	0.808968	0.812116	0.820194	0.818594	0.831445	0.831419
32.0	0.813781	0.816994	0.821819	0.821871	0.831458	0.825006

In [165]:

x = grid_pivot.columns.levels[1].values
y = grid_pivot.index.values
z = grid_pivot.values

In [166]:

import plotly.graph_objects as gb

layout = gb.Layout(
            xaxis=gb.layout.XAxis(
              title=gb.layout.xaxis.Title(
              text='n_estimators')
             ),
             yaxis=gb.layout.YAxis(
              title=gb.layout.yaxis.Title(
              text='max_depth') 
            ) )

fig = gb.Figure(data = [gb.Contour(z=z, x=x, y=y)], layout=layout )

fig.update_layout(title='Graph showing n_estimators v/s max_depth', autosize=False,
                  width=700, height=700,
                  margin=dict(l=65, r=50, b=65, t=90))

fig.show()

In [167]:

fig = gb.Figure(data= [gb.Surface(z=z, y=y, x=x)], layout=layout )
fig.update_layout(title='Hyperparameter Tuning',
                  scene = dict(
                    xaxis_title='n_estimators',
                    yaxis_title='max_depth',
                    zaxis_title='Accuracy'),
                  autosize=False,
                  width=800, height=800,
                  margin=dict(l=65, r=50, b=65, t=90))
fig.show()

Extreme Gradient Boosting (XGBoost) Model and perform Hyperparameter Tunning¶

In [168]:

# conda install -c conda-forge xgboost
# !pip install xgboost

In [169]:

from xgboost import XGBClassifier

In [170]:

XGB_cl = XGBClassifier()

In [173]:

model_XGB = XGB_cl.fit(X_train, y_train)

In [174]:

model_XGB.score(X_train, y_train), model_XGB.score(X_test, y_test)

Out[174]:

(0.9309791332263242, 0.8208955223880597)

In [175]:

preds = model_XGB.predict(X_test)
preds[1:10]

Out[175]:

array([0, 0, 0, 1, 0, 1, 0, 0, 1])

In [176]:

preds = model_XGB.predict_proba(X_test)[:,1]
preds[1:10]

Out[176]:

array([8.8950247e-03, 4.6509650e-01, 1.9177219e-01, 9.9429333e-01,
       5.5087014e-04, 9.8206317e-01, 4.4684451e-02, 1.3533361e-01,
       9.2851740e-01], dtype=float32)

In [177]:

train_result = classification_table_AA(model_XGB,X_train, y_train)
train_result

Out[177]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	384.0	0.0	239.0	0.383628	0.383628	1.000000	0.554524
1	0.1	255.0	129.0	5.0	234.0	0.784912	0.644628	0.979079	0.777409
2	0.2	320.0	64.0	14.0	225.0	0.874799	0.778547	0.941423	0.852273
3	0.3	352.0	32.0	19.0	220.0	0.918138	0.873016	0.920502	0.896130
4	0.4	363.0	21.0	25.0	214.0	0.926164	0.910638	0.895397	0.902954
5	0.5	370.0	14.0	29.0	210.0	0.930979	0.937500	0.878661	0.907127
6	0.6	372.0	12.0	32.0	207.0	0.929374	0.945205	0.866109	0.903930
7	0.7	378.0	6.0	49.0	190.0	0.911717	0.969388	0.794979	0.873563
8	0.8	381.0	3.0	68.0	171.0	0.886035	0.982759	0.715481	0.828087
9	0.9	384.0	0.0	109.0	130.0	0.825040	1.000000	0.543933	0.704607
10	1.0	384.0	0.0	239.0	0.0	0.616372	0.000000	0.000000	0.000000

In [178]:

# Let's plot ROC curve for RF Model
ROC_Curve(model_XGB,y_train,X_train,y_test,X_test)

In [179]:

importance = model_XGB.feature_importances_
feature_importance = pd.Series(importance, index = X_train.columns)
feature_importance = feature_importance.sort_values(ascending = False)
feature_importance.plot(kind = 'bar')
plt.ylabel('Importance');

n_estimators: Total number of trees [Dealut value is 100]
learning_rate:This determines the impact of each tree on the final outcome [Default value is 0.1, range 0 to 1]
random_state: The random number seed so that same random numbers are generated every time
max_depth: Maximum depth to which tree can grow (stopping criteria) [Default value is 6, more the value more overfitting it may lead to]
subsample: The fraction of observations to be selected for each tree. Selection is done by random sampling [Default is 1, if we set it to 0.5, XGBoost would randomly sample 50% of training data prior to growing trees]
objective: Defines Loss function (binary:logistic is for classification using probability, reg:logistic is for classification, reg:linear is for regression)
colsample_bylevel: Random feature selection at levels [Default is 1 i.e. all varaibles, we can define 0 to 1]
colsample_bytree: Subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed [Default is 1 i.e. all varaibles, we can define 0 to 1]

Gradient Descent tries to bring a balance between Bias and variance (minimize both) and brings the model to global minima of error.

In [180]:

XGB_model_CV = XGBClassifier()
parameters = {
    'n_estimators': [5, 10, 25, 50, 250],
    'max_depth': [2, 4, 8, 16, 32, None],
    'colsample_bytree' : [0.1, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.6] ,
    'objective' : ['binary:logistic']
}

XGB_model_CV = GridSearchCV(XGB_model_CV, parameters, cv=5)
XGB_model_CV.fit(X_train, y_train.values.ravel())

CV_Print(XGB_model_CV)

Out[180]:

	Params	Means	Stds
0	{'colsample_bytree': 0.1, 'max_depth': 2, 'n_e...	0.703097	0.033965
1	{'colsample_bytree': 0.1, 'max_depth': 2, 'n_e...	0.714297	0.026990
2	{'colsample_bytree': 0.1, 'max_depth': 2, 'n_e...	0.786568	0.021365
3	{'colsample_bytree': 0.1, 'max_depth': 2, 'n_e...	0.804155	0.012322
4	{'colsample_bytree': 0.1, 'max_depth': 2, 'n_e...	0.824994	0.019001
...	...	...	...
265	{'colsample_bytree': 0.6, 'max_depth': None, '...	0.796194	0.013296
266	{'colsample_bytree': 0.6, 'max_depth': None, '...	0.812206	0.011853
267	{'colsample_bytree': 0.6, 'max_depth': None, '...	0.831445	0.012594
268	{'colsample_bytree': 0.6, 'max_depth': None, '...	0.821781	0.013500
269	{'colsample_bytree': 0.6, 'max_depth': None, '...	0.815368	0.014855

270 rows × 3 columns

In [181]:

XGB_model_CV.best_estimator_

Out[181]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.4, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=8, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=25, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)

In [182]:

XGB_model_CV.score(X_train, y_train), XGB_model_CV.score(X_test, y_test)

Out[182]:

(0.8940609951845907, 0.8208955223880597)

In [183]:

train_result = classification_table_AA(XGB_model_CV,X_train, y_train)
train_result

Out[183]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	384.0	0.0	239.0	0.383628	0.383628	1.000000	0.554524
1	0.1	150.0	234.0	3.0	236.0	0.619583	0.502128	0.987448	0.665726
2	0.2	278.0	106.0	17.0	222.0	0.802568	0.676829	0.928870	0.783069
3	0.3	331.0	53.0	30.0	209.0	0.866774	0.797710	0.874477	0.834331
4	0.4	345.0	39.0	33.0	206.0	0.884430	0.840816	0.861925	0.851240
5	0.5	355.0	29.0	37.0	202.0	0.894061	0.874459	0.845188	0.859574
6	0.6	363.0	21.0	56.0	183.0	0.876404	0.897059	0.765690	0.826185
7	0.7	375.0	9.0	84.0	155.0	0.850722	0.945122	0.648536	0.769231
8	0.8	382.0	2.0	123.0	116.0	0.799358	0.983051	0.485356	0.649860
9	0.9	384.0	0.0	173.0	66.0	0.722311	1.000000	0.276151	0.432787
10	1.0	384.0	0.0	239.0	0.0	0.616372	0.000000	0.000000	0.000000

In [184]:

test_result = classification_table_AA(XGB_model_CV,X_test, y_test)
test_result

Out[184]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	87.0	0.0	47.0	0.350746	0.350746	1.000000	0.519337
1	0.1	27.0	60.0	5.0	42.0	0.514925	0.411765	0.893617	0.563758
2	0.2	57.0	30.0	8.0	39.0	0.716418	0.565217	0.829787	0.672414
3	0.3	66.0	21.0	12.0	35.0	0.753731	0.625000	0.744681	0.679612
4	0.4	74.0	13.0	14.0	33.0	0.798507	0.717391	0.702128	0.709677
5	0.5	77.0	10.0	14.0	33.0	0.820896	0.767442	0.702128	0.733333
6	0.6	80.0	7.0	17.0	30.0	0.820896	0.810811	0.638298	0.714286
7	0.7	80.0	7.0	21.0	26.0	0.791045	0.787879	0.553191	0.650000
8	0.8	85.0	2.0	27.0	20.0	0.783582	0.909091	0.425532	0.579710
9	0.9	87.0	0.0	39.0	8.0	0.708955	1.000000	0.170213	0.290909
10	1.0	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000

In [185]:

ROC_Curve(XGB_model_CV,y_train,X_train,y_test,X_test)

Support Vector Machine (SVM) and perform Hyperparameter Tunning¶¶

In [186]:

from sklearn.svm import SVC
SVM = SVC()
#SVM.probability=True  # Not recommended, as it makes it extremely slow

In [187]:

SVM_model = SVM.fit(X_train, y_train)

In [188]:

SVM_model.score(X_train, y_train), SVM_model.score(X_test, y_test)

Out[188]:

(0.8170144462279294, 0.8432835820895522)

In [189]:

SVM = SVC()  # probability=True
parameters = {
    'kernel': ['linear', 'rbf'],   #can also add 'sigmoid','poly'
    'C': [0.01, 0.1, 1, 10]
            }

SVM_CV = GridSearchCV(SVM, parameters, cv=5)
SVM_CV.fit(X_train, y_train.values.ravel())

CV_Print(SVM_CV)

Out[189]:

	Params	Means	Stds
0	{'C': 0.01, 'kernel': 'linear'}	0.754465	0.022440
1	{'C': 0.01, 'kernel': 'rbf'}	0.616374	0.002591
2	{'C': 0.1, 'kernel': 'linear'}	0.791265	0.018372
3	{'C': 0.1, 'kernel': 'rbf'}	0.804116	0.015619
4	{'C': 1, 'kernel': 'linear'}	0.791265	0.018372
5	{'C': 1, 'kernel': 'rbf'}	0.808955	0.011228
6	{'C': 10, 'kernel': 'linear'}	0.791265	0.018372
7	{'C': 10, 'kernel': 'rbf'}	0.818568	0.015521

In [190]:

SVM_CV.best_estimator_

Out[190]:

SVC(C=10)

In [191]:

SVM_model_CV = SVM_CV.fit(X_train, y_train.values.ravel())

In [192]:

SVM_model_CV.score(X_train, y_train), SVM_model_CV.score(X_test, y_test)

Out[192]:

(0.8250401284109149, 0.8283582089552238)

In [193]:

preds = SVM_model_CV.predict(X_test)
preds[1:10]

Out[193]:

array([0, 0, 0, 1, 0, 1, 0, 0, 1], dtype=int64)

K-Nearest Neighbors and perform Hyperparameter Tunning¶¶

In [194]:

from sklearn.neighbors import KNeighborsClassifier

# Create a KNN classifier object
knn = KNeighborsClassifier()

model_KNN = knn.fit(X_train, y_train)

model_KNN.score(X_train, y_train), model_KNN.score(X_test, y_test)

Out[194]:

(0.8635634028892456, 0.7910447761194029)

In [195]:

preds = model_KNN.predict(X_test)
preds[1:10]

Out[195]:

array([0, 0, 1, 1, 0, 1, 0, 0, 1], dtype=int64)

In [196]:

preds = model_KNN.predict_proba(X_test)[:,1]
preds[1:10]

Out[196]:

array([0. , 0.4, 0.6, 1. , 0. , 1. , 0.2, 0. , 0.8])

In [197]:

train_result = classification_table_AA(model_KNN,X_train, y_train)
train_result

Out[197]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	384.0	0.0	239.0	0.383628	0.383628	1.000000	0.554524
1	0.1	214.0	170.0	6.0	233.0	0.717496	0.578164	0.974895	0.725857
2	0.2	214.0	170.0	6.0	233.0	0.717496	0.578164	0.974895	0.725857
3	0.3	295.0	89.0	22.0	217.0	0.821830	0.709150	0.907950	0.796330
4	0.4	295.0	89.0	22.0	217.0	0.821830	0.709150	0.907950	0.796330
5	0.5	350.0	34.0	51.0	188.0	0.863563	0.846847	0.786611	0.815618
6	0.6	370.0	14.0	79.0	160.0	0.850722	0.919540	0.669456	0.774818
7	0.7	370.0	14.0	79.0	160.0	0.850722	0.919540	0.669456	0.774818
8	0.8	370.0	14.0	79.0	160.0	0.850722	0.919540	0.669456	0.774818
9	0.9	382.0	2.0	135.0	104.0	0.780096	0.981132	0.435146	0.602899
10	1.0	382.0	2.0	135.0	104.0	0.780096	0.981132	0.435146	0.602899

In [198]:

# Let's plot ROC curve for KNN Model
ROC_Curve(model_KNN,y_train,X_train,y_test,X_test)

In [199]:

KNN_model_CV = KNeighborsClassifier()
parameters = {
    'n_neighbors': [3,5,7,9,11],
    'weights': ['uniform', 'distance'],
    'metric' : ['minkowski','euclidean','manhattan']
}

KNN_model_CV = GridSearchCV(KNN_model_CV, parameters, cv=5)
KNN_model_CV.fit(X_train, y_train.values.ravel())

CV_Print(KNN_model_CV)

Out[199]:

	Params	Means	Stds
0	{'metric': 'minkowski', 'n_neighbors': 3, 'wei...	0.804168	0.016600
1	{'metric': 'minkowski', 'n_neighbors': 3, 'wei...	0.786439	0.022656
2	{'metric': 'minkowski', 'n_neighbors': 5, 'wei...	0.823497	0.022813
3	{'metric': 'minkowski', 'n_neighbors': 5, 'wei...	0.797716	0.017589
4	{'metric': 'minkowski', 'n_neighbors': 7, 'wei...	0.817058	0.023160
5	{'metric': 'minkowski', 'n_neighbors': 7, 'wei...	0.797703	0.016182
6	{'metric': 'minkowski', 'n_neighbors': 9, 'wei...	0.823419	0.023946
7	{'metric': 'minkowski', 'n_neighbors': 9, 'wei...	0.797703	0.016954
8	{'metric': 'minkowski', 'n_neighbors': 11, 'we...	0.821781	0.019055
9	{'metric': 'minkowski', 'n_neighbors': 11, 'we...	0.797703	0.016954
10	{'metric': 'euclidean', 'n_neighbors': 3, 'wei...	0.804168	0.016600
11	{'metric': 'euclidean', 'n_neighbors': 3, 'wei...	0.786439	0.022656
12	{'metric': 'euclidean', 'n_neighbors': 5, 'wei...	0.823497	0.022813
13	{'metric': 'euclidean', 'n_neighbors': 5, 'wei...	0.797716	0.017589
14	{'metric': 'euclidean', 'n_neighbors': 7, 'wei...	0.817058	0.023160
15	{'metric': 'euclidean', 'n_neighbors': 7, 'wei...	0.797703	0.016182
16	{'metric': 'euclidean', 'n_neighbors': 9, 'wei...	0.823419	0.023946
17	{'metric': 'euclidean', 'n_neighbors': 9, 'wei...	0.797703	0.016954
18	{'metric': 'euclidean', 'n_neighbors': 11, 'we...	0.821781	0.019055
19	{'metric': 'euclidean', 'n_neighbors': 11, 'we...	0.797703	0.016954
20	{'metric': 'manhattan', 'n_neighbors': 3, 'wei...	0.807381	0.013470
21	{'metric': 'manhattan', 'n_neighbors': 3, 'wei...	0.789652	0.022518
22	{'metric': 'manhattan', 'n_neighbors': 5, 'wei...	0.826710	0.025280
23	{'metric': 'manhattan', 'n_neighbors': 5, 'wei...	0.800929	0.015996
24	{'metric': 'manhattan', 'n_neighbors': 7, 'wei...	0.818658	0.020997
25	{'metric': 'manhattan', 'n_neighbors': 7, 'wei...	0.797703	0.016182
26	{'metric': 'manhattan', 'n_neighbors': 9, 'wei...	0.826619	0.026361
27	{'metric': 'manhattan', 'n_neighbors': 9, 'wei...	0.799303	0.015932
28	{'metric': 'manhattan', 'n_neighbors': 11, 'we...	0.823381	0.024715
29	{'metric': 'manhattan', 'n_neighbors': 11, 'we...	0.800903	0.019857

In [200]:

KNN_model_CV.best_estimator_

Out[200]:

KNeighborsClassifier(metric='manhattan')

In [201]:

accuracy = KNN_model_CV.score(X_test, y_test)
print("Test Accuracy: ", accuracy)

Test Accuracy:  0.7910447761194029

In [202]:

train_result = classification_table_AA(KNN_model_CV,X_train, y_train)
train_result

Out[202]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	384.0	0.0	239.0	0.383628	0.383628	1.000000	0.554524
1	0.1	214.0	170.0	6.0	233.0	0.717496	0.578164	0.974895	0.725857
2	0.2	214.0	170.0	6.0	233.0	0.717496	0.578164	0.974895	0.725857
3	0.3	295.0	89.0	22.0	217.0	0.821830	0.709150	0.907950	0.796330
4	0.4	295.0	89.0	22.0	217.0	0.821830	0.709150	0.907950	0.796330
5	0.5	350.0	34.0	51.0	188.0	0.863563	0.846847	0.786611	0.815618
6	0.6	370.0	14.0	77.0	162.0	0.853933	0.920455	0.677824	0.780723
7	0.7	370.0	14.0	77.0	162.0	0.853933	0.920455	0.677824	0.780723
8	0.8	370.0	14.0	77.0	162.0	0.853933	0.920455	0.677824	0.780723
9	0.9	382.0	2.0	135.0	104.0	0.780096	0.981132	0.435146	0.602899
10	1.0	382.0	2.0	135.0	104.0	0.780096	0.981132	0.435146	0.602899

In [203]:

test_result = classification_table_AA(KNN_model_CV,X_test, y_test)
test_result

Out[203]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	87.0	0.0	47.0	0.350746	0.350746	1.000000	0.519337
1	0.1	44.0	43.0	8.0	39.0	0.619403	0.475610	0.829787	0.604651
2	0.2	44.0	43.0	8.0	39.0	0.619403	0.475610	0.829787	0.604651
3	0.3	62.0	25.0	8.0	39.0	0.753731	0.609375	0.829787	0.702703
4	0.4	62.0	25.0	8.0	39.0	0.753731	0.609375	0.829787	0.702703
5	0.5	72.0	15.0	13.0	34.0	0.791045	0.693878	0.723404	0.708333
6	0.6	80.0	7.0	18.0	29.0	0.813433	0.805556	0.617021	0.698795
7	0.7	80.0	7.0	18.0	29.0	0.813433	0.805556	0.617021	0.698795
8	0.8	80.0	7.0	18.0	29.0	0.813433	0.805556	0.617021	0.698795
9	0.9	86.0	1.0	26.0	21.0	0.798507	0.954545	0.446809	0.608696
10	1.0	86.0	1.0	26.0	21.0	0.798507	0.954545	0.446809	0.608696

In [204]:

ROC_Curve(KNN_model_CV,y_train,X_train,y_test,X_test)

Neural Network Model and perform Hyperparameter Tunning¶¶

In [ ]:

#!pip install tensorflow

In [205]:

import tensorflow as tf

In [206]:

from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [207]:

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [208]:

# Create a TensorFlow model
def create_model(units=16, activation='relu', optimizer='adam'):
    model = Sequential()
    model.add(Dense(units, activation=activation, input_dim=5))
    model.add(Dense(units, activation=activation))
    model.add(Dense(3, activation='softmax'))
    model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

model = Sequential(): This line creates a new sequential model object. The sequential model is a linear stack of layers, where you can add layers one by one.
model.add(Dense(units, activation=activation, input_dim=5)): This line adds a dense layer to the model. A dense layer is a fully connected layer, where each neuron is connected to all the neurons in the previous layer. The units parameter specifies the number of neurons in the layer. The activation parameter defines the activation function to be used in the layer. The input_dim parameter is used only for the first layer and specifies the input shape.
model.add(Dense(units, activation=activation)): This line adds another dense layer to the model. It has the same number of units as the previous layer and uses the same activation function.
model.add(Dense(3, activation='softmax')): This line adds the output layer to the model. It is a dense layer with 3 neurons, representing the number of classes in the classification problem. The activation function 'softmax' is used to compute the probability distribution over the classes.
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy']): This line compiles the model. It specifies the optimizer to be used during training, the loss function to optimize (sparse categorical cross-entropy for multi-class classification problems), and the metrics to evaluate during training and testing (in this case, accuracy).

In [209]:

NN_model = tf.keras.wrappers.scikit_learn.KerasClassifier(build_fn=create_model, epochs=10, verbose=0)

In [210]:

model_tf = NN_model.fit(X_train_scaled, y_train)

In [211]:

model_history = NN_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=150)

simple method to optimize the model¶

In [212]:

# summarize history for loss
plt.plot(model_history.history['loss'])
plt.plot(model_history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [213]:

model_tf = NN_model.fit(X_train_scaled, y_train, epochs=18)

In [214]:

print(accuracy_score(NN_model.predict(X_train), y_train), "\n",
accuracy_score(NN_model.predict(X_test), y_test))

20/20 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 1ms/step
0.6211878009630819 
 0.6492537313432836

In [215]:

test_result = classification_table_AA(NN_model,X_test, y_test)
test_result

5/5 [==============================] - 0s 3ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 3ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step

Out[215]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	87.0	0.0	47.0	0.350746	0.350746	1.000000	0.519337
1	0.1	61.0	26.0	11.0	36.0	0.723881	0.580645	0.765957	0.660550
2	0.2	76.0	11.0	23.0	24.0	0.746269	0.685714	0.510638	0.585366
3	0.3	87.0	0.0	37.0	10.0	0.723881	1.000000	0.212766	0.350877
4	0.4	87.0	0.0	45.0	2.0	0.664179	1.000000	0.042553	0.081633
5	0.5	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
6	0.6	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
7	0.7	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
8	0.8	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
9	0.9	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
10	1.0	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000

In [216]:

ROC_Curve(NN_model,y_train,X_train_scaled,y_test,X_test_scaled)

20/20 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step

Hyperparamter Tuning of the Neural network Model¶

In [217]:

model.get_params().keys()

Out[217]:

dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])

In [218]:

parameters = {
    'units': [8, 16, 32],
    'activation': ['relu', 'sigmoid'],
    'optimizer': ['adam', 'rmsprop']
}

model_tf_CV = GridSearchCV(NN_model, parameters, cv=5)
model_tf_CV.fit(X_train_scaled, y_train.values.ravel())

CV_Print(model_tf_CV)

Out[218]:

	Params	Means	Stds
0	{'activation': 'relu', 'optimizer': 'adam', 'u...	0.739987	0.039263
1	{'activation': 'relu', 'optimizer': 'adam', 'u...	0.781729	0.042328
2	{'activation': 'relu', 'optimizer': 'adam', 'u...	0.810490	0.029660
3	{'activation': 'relu', 'optimizer': 'rmsprop',...	0.776929	0.038429
4	{'activation': 'relu', 'optimizer': 'rmsprop',...	0.805742	0.018218
5	{'activation': 'relu', 'optimizer': 'rmsprop',...	0.797716	0.026300
6	{'activation': 'sigmoid', 'optimizer': 'adam',...	0.616477	0.043670
7	{'activation': 'sigmoid', 'optimizer': 'adam',...	0.626103	0.047453
8	{'activation': 'sigmoid', 'optimizer': 'adam',...	0.711084	0.042661
9	{'activation': 'sigmoid', 'optimizer': 'rmspro...	0.616477	0.043670
10	{'activation': 'sigmoid', 'optimizer': 'rmspro...	0.621277	0.047170
11	{'activation': 'sigmoid', 'optimizer': 'rmspro...	0.736916	0.057874

In [219]:

# Print the best hyperparameters and the corresponding mean cross-validated score
print("Best Hyperparameters: ", model_tf_CV.best_params_)
print("Best Score: ", model_tf_CV.best_score_)

# Evaluate the model on the test set using the best hyperparameters
best_model = model_tf_CV.best_estimator_
print("Best_Model: ", best_model)

Best Hyperparameters:  {'activation': 'relu', 'optimizer': 'adam', 'units': 32}
Best Score:  0.8104903221130371
Best_Model:  <keras.wrappers.scikit_learn.KerasClassifier object at 0x000002518B6AFD60>

In [221]:

train_result = classification_table_AA(model_tf_CV,X_test, y_test)
train_result

5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 3ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 3ms/step
5/5 [==============================] - 0s 1ms/step
5/5 [==============================] - 0s 3ms/step

Out[221]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	87.0	0.0	47.0	0.350746	0.350746	1.000000	0.519337
1	0.1	67.0	20.0	18.0	29.0	0.716418	0.591837	0.617021	0.604167
2	0.2	87.0	0.0	24.0	23.0	0.820896	1.000000	0.489362	0.657143
3	0.3	87.0	0.0	36.0	11.0	0.731343	1.000000	0.234043	0.379310
4	0.4	87.0	0.0	45.0	2.0	0.664179	1.000000	0.042553	0.081633
5	0.5	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
6	0.6	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
7	0.7	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
8	0.8	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
9	0.9	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
10	1.0	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000

In [222]:

test_result = classification_table_AA(model_tf_CV,X_test, y_test)
test_result

5/5 [==============================] - 0s 3ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 3ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step

Out[222]:

	Threshold	TP	FP	FN	TN	Accuracy	Precision	TPR_Recall	F1_Score
0	0.0	0.0	87.0	0.0	47.0	0.350746	0.350746	1.000000	0.519337
1	0.1	67.0	20.0	18.0	29.0	0.716418	0.591837	0.617021	0.604167
2	0.2	87.0	0.0	24.0	23.0	0.820896	1.000000	0.489362	0.657143
3	0.3	87.0	0.0	36.0	11.0	0.731343	1.000000	0.234043	0.379310
4	0.4	87.0	0.0	45.0	2.0	0.664179	1.000000	0.042553	0.081633
5	0.5	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
6	0.6	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
7	0.7	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
8	0.8	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
9	0.9	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000
10	1.0	87.0	0.0	47.0	0.0	0.649254	0.000000	0.000000	0.000000

In [223]:

ROC_Curve(model_tf_CV,y_train,X_train_scaled,y_test,X_test_scaled)

20/20 [==============================] - 0s 2ms/step
5/5 [==============================] - 0s 2ms/step

Neural network model is not working properly here because of Insufficient data, it need high volume of data. Other cases, where it might not work well are:

Imbalanced classes: If you have imbalanced classes in your dataset, where the positive class is rare compared to the negative class, it might be challenging for the model to correctly predict the positive class. In such cases, the model might be biased towards predicting the negative class, leading to low precision for the positive class.

Insufficient training data: If you have limited training data or an insufficient representation of the positive class, the model might struggle to learn patterns specific to the positive class. This can result in lower precision when predicting the positive class, especially when the model's predictions are based on probabilities.

nappropriate probability threshold: The probability threshold you are using to calculate precision might not be optimal for your specific problem. By default, the threshold for classification is usually set at 0.5, but you can adjust this threshold to achieve a desired precision-recall trade-off based on your problem requirements.

Compare model results and Champion model selection¶

In this section, we will:

Evaluate all of our saved models on the validation set
Select the best model based on performance on the validation set
Evaluate that model on the holdout test set

In [224]:

import joblib
from time import time
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [225]:

joblib.dump(Logistic_model_cv.best_estimator_, 'Logistic_model_cv.pkl')
joblib.dump(DT_model_CV.best_estimator_, 'DT_model_CV.pkl')
joblib.dump(RF_model_CV.best_estimator_, 'RF_model_CV.pkl')
joblib.dump(XGB_model_CV.best_estimator_, 'XGB_model_CV.pkl')
joblib.dump(SVM_model_CV.best_estimator_, 'SVM_model_CV.pkl')
joblib.dump(KNN_model_CV.best_estimator_, 'KNN_model_CV.pkl')
joblib.dump(model_tf_CV.best_estimator_, 'model_tf_CV.pkl')

Out[225]:

['model_tf_CV.pkl']

In [226]:

list_model  = ['Logistic_model_cv', 'DT_model_CV', 'RF_model_CV', 'XGB_model_CV', 'SVM_model_CV','KNN_model_CV','model_tf_CV']

In [227]:

len(list_model)

Out[227]:

In [228]:

models = {}
for mdl in list_model:
    models[mdl] = joblib.load('{}.pkl'.format(mdl))

In [229]:

models.values()

Out[229]:

dict_values([LogisticRegression(C=40), DecisionTreeClassifier(max_depth=5, min_samples_leaf=8, random_state=1234), RandomForestClassifier(max_depth=8, n_estimators=250), XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.4, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=8, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=25, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...), SVC(C=10), KNeighborsClassifier(metric='manhattan'), <keras.wrappers.scikit_learn.KerasClassifier object at 0x000002518C92C070>])

In [230]:

def evaluate_model_AA(list_model, X, y):
    df = pd.DataFrame(columns=['Model','Accuracy','Precision','Recall','F1_score','Latency'])
    
    for n in range(len(list_model)):
        name = list(models.keys())[n]
        mdl = list(models.values())[n]
        
        start = time()
        pred = mdl.predict(X)
        end = time()
        accuracy = round(accuracy_score(y, pred), 3)
        precision = round(precision_score(y, pred), 3)
        recall = round(recall_score(y, pred), 3)
        f1 = round(f1_score(y, pred), 3)
  
        new_row = {'Model'   : name
                , 'Accuracy'     : accuracy
                , 'Precision'    : precision 
                , 'Recall'       : recall
                , 'F1_score'     : f1
                , 'Latency'      : round((end - start)*1000, 1)
                      }
        df = df.append(new_row, ignore_index=True)
    return(df)

In [231]:

evaluate_model_AA(list_model, X_train, y_train)

20/20 [==============================] - 0s 2ms/step

Out[231]:

	Model	Accuracy	Precision	Recall	F1_score	Latency
0	Logistic_model_cv	0.807	0.762	0.724	0.742	2.1
1	DT_model_CV	0.849	0.825	0.770	0.797	1.0
2	RF_model_CV	0.913	0.915	0.854	0.883	72.7
3	XGB_model_CV	0.894	0.874	0.845	0.860	4.3
4	SVM_model_CV	0.825	0.801	0.724	0.760	32.9
5	KNN_model_CV	0.864	0.847	0.787	0.816	33.0
6	model_tf_CV	0.616	0.000	0.000	0.000	180.6

In [232]:

evaluate_model_AA(list_model, X_test, y_test)

5/5 [==============================] - 0s 2ms/step

Out[232]:

	Model	Accuracy	Precision	Recall	F1_score	Latency
0	Logistic_model_cv	0.754	0.659	0.617	0.637	2.0
1	DT_model_CV	0.806	0.744	0.681	0.711	3.0
2	RF_model_CV	0.799	0.738	0.660	0.697	56.8
3	XGB_model_CV	0.821	0.767	0.702	0.733	7.2
4	SVM_model_CV	0.828	0.800	0.681	0.736	9.3
5	KNN_model_CV	0.791	0.694	0.723	0.708	8.0
6	model_tf_CV	0.649	0.000	0.000	0.000	87.7

In [233]:

evaluate_model_AA(list_model, X_val, y_val)

5/5 [==============================] - 0s 2ms/step

Out[233]:

	Model	Accuracy	Precision	Recall	F1_score	Latency
0	Logistic_model_cv	0.784	0.814	0.625	0.707	1.9
1	DT_model_CV	0.843	0.872	0.732	0.796	3.2
2	RF_model_CV	0.881	0.955	0.750	0.840	41.8
3	XGB_model_CV	0.873	0.915	0.768	0.835	2.0
4	SVM_model_CV	0.843	0.889	0.714	0.792	9.7
5	KNN_model_CV	0.821	0.833	0.714	0.769	6.5
6	model_tf_CV	0.582	0.000	0.000	0.000	89.2

Now using tbe table we can conculde which model can be used for our classfication problem. In conclusion, our journey through the world of classification models in Python has been quite exciting. We've explored the significance of feature engineering, fine-tuning, and the art of selecting the right algorithm for the task at hand. From the Titanic dataset's historical narrative to the broader applications of classification in data science, we've gained valuable insights and practical knowledge. Remember that while Python provides a rich ecosystem for building classifiers, the real mastery lies in understanding your data, asking the right questions, understanding business intricacies and iteratively improving your models. Classification models have the potential to solve real-world problems, from fraud detection to medical diagnoses and beyond. Happy Learning!

Pages

Classification Modeling with Python