Data Wrangling with Pandas FINAL-Copy1

Data Wrangling with Pandas¶

Data wrangling is the practice of converting data from a "raw" form into a user-ready form for descriptive analytics and provide feed for other horizons of analytics such as predictive analytics.

*Pandas is a data-centric package(library) of Python eco-system for importing, manipulating, managing and analyzing data. This library was originally built on NumPy, the fundamental library for scientific computation in Python¶

*A DataFrame is a two-dimensional data structure. We can perform many operations on these datasets such as arithmetic operation, columns/rows selection, columns/rows addition etc.¶

Module 0: Before jumping to Pandas, let's cover few basics of Python¶

In [1]:

# There are various type of variables in Python
a =  1                 # integer
b = 1.08               # float
c = "Ask Analytics"    # string
d = True               #boolean

print("a is of type: ", type(a) , ", b is of type: ",type(b), ", c is of type: ",type(c), ", d is of type: ",type(d))

a is of type:  <class 'int'> , b is of type:  <class 'float'> , c is of type:  <class 'str'> , d is of type:  <class 'bool'>

In [2]:

# We can convert the type of variables:
a = 1
b = float(a)
b

Out[2]:

1.0

In [3]:

a = 1
b = str(a)
b

Out[3]:

'1'

In [4]:

#Python name rules:
#Names are case sensitive, can not start with number or any special character except underscore ('_'). 
#Names can not be key words such as :  False, True, None, lambda, etc.

In [5]:

#Like any language, there are basic operators in Python as well. 
#Most of the operators are same as those in SQL or any other languages.

#-------------- Arithmetic Operators ------------------------------------------

#+  Addition          -  Subtraction    *  Multiplication    / Division     
#%  Modulus           ** Power          // Floor Division

#------------ Comparison Operators ------------------------------------------

#== Equal                          != Not equal
#> more than                    < less than
#>= more than or equal to       <= less than or equal to

#------------ Logical Operators ------------------------------------------

#and     (at places we also use '&' for and)
#or      (we can also use '|' for or )
#not     (we can also use '~' for not, at places)

In [6]:

# Arithmetic Operators
print('5 + 4 = ' , 5 + 4)
print('5 - 4 = ' , 5 - 4)
print('5 * 4 = ' , 5 * 4)
print('5 / 4 = ' , 5 / 4)
print('7 % 4 = ' , 7 % 4)
print('7 // 4 = ', 7 // 4)
print('7 ** 4 = ', 7 ** 4)

5 + 4 =  9
5 - 4 =  1
5 * 4 =  20
5 / 4 =  1.25
7 % 4 =  3
7 // 4 =  1
7 ** 4 =  2401

In [7]:

# Comparison Operators
print('5 == 4', 5 == 4)
print('5 != 4', 5 != 4)
print('5 <  4', 5 < 4)
print('5 >  4', 5 > 4)
print('5 <= 4', 5 <= 4)
print('5 >= 4', 5 >= 4)

5 == 4 False
5 != 4 True
5 <  4 False
5 >  4 True
5 <= 4 False
5 >= 4 True

In [8]:

# Logical Operators
print('5 > 4 or 3 < 2 will resolve to:', 5 > 4 or 3 < 2)
print('5 > 4 and 3 < 2 will resolve to:', 5 > 4 and 3 < 2)
print('not(5 > 4 and 3 < 2) will resolve to:', not(5 > 4 and 3 < 2))

5 > 4 or 3 < 2 will resolve to: True
5 > 4 and 3 < 2 will resolve to: False
not(5 > 4 and 3 < 2) will resolve to: True

Loops in Python¶

In [9]:

for number in range(4):
    print(number)

In [10]:

for number in range(1, 10, 2):
    print(number, end= ' ')    # default end is 'end of line'm but you can customize it

1 3 5 7 9

In [11]:

for character in "Ask Analytics":
    print(character, end= '#')

A#s#k# #A#n#a#l#y#t#i#c#s#

In [12]:

for outer_loop_number in range(0, 2):
    # above one is an outer loop, and below would be a inner loop
    for inner_loop_character in "XYZ":
        
        print(outer_loop_number, inner_loop_character)

0 X
0 Y
0 Z
1 X
1 Y
1 Z

In [13]:

#While Loop, it keep on running, until the condition is true
x = 0
while (x <= 5):
    print(x)
    x = x + 2

0
2
4

We can use three methods to jump in the loops : break, continue and pass¶

In [14]:

#Break statement - 
x =  0

for x in range(10):
    if x  == 7:
        break
    print(x)    
# Basiscally, it kept on running until, it encountered 7
# the moment it met the condition for "break", it came out of the loop

In [15]:

#Continue statement - 
x =  0

for x in range(10):
    if x  == 7:
        continue
    print(x)    
# It kept on running till end, but when it met the condition of "continue", it skipped that particular iteration only
# so in simpler english, it is "skip"

In [16]:

#pass statement - 
x =  0

for x in range(10):
    if x  == 7:
        pass
    print(x)    
# With pass, it actually didn't do anything,
# but the pass statement became a "placeholder", where in future you can put some code with action 

Basic data strucures in Python : List, tuples, dictionary and sets¶

In [17]:

# Here is an example of List
list_1 = [1,3,5,7,9]
print(list_1)

[1, 3, 5, 7, 9]

In [18]:

# In python, index of elements start with 0 from left: 0--> 1-->2-->3. 
# From right side index is like:   -4  <--  -3  <--  -2  <-- -1

# That's why in this tutorial, index of modules starts with zero!

In [19]:

list_1[0] # first element

Out[19]:

In [20]:

list_1[-1] # last element

Out[20]:

In [21]:

list_1[1:4] # element with index 1, 2 and 3. so in the range, last one is not included

Out[21]:

[3, 5, 7]

In [22]:

list_1[0:5:2] # element with index 0,2,4

Out[22]:

[1, 5, 9]

In [23]:

list_1[::-1] # reversed list

Out[23]:

[9, 7, 5, 3, 1]

In [24]:

list_1.append(11)   # append can add only one element (individual element or one list) at a time
list_1

Out[24]:

[1, 3, 5, 7, 9, 11]

In [25]:

list_1.append([13,15])
list_1

Out[25]:

[1, 3, 5, 7, 9, 11, [13, 15]]

In [26]:

list_1.extend([13,15])  # extend can add more than one element at a time
list_1

Out[26]:

[1, 3, 5, 7, 9, 11, [13, 15], 13, 15]

In [27]:

list_1.insert(2,'x')  # insert 'x' at 2nd index position
list_1

Out[27]:

[1, 3, 'x', 5, 7, 9, 11, [13, 15], 13, 15]

You can remove, find index, identify if any element is one list or not etc. But for now, we can't to focus more of Pandas

Module 1: Importing data in as pandas DataFrame - with read_csv and read_excel and read_json¶

Let's first read CSV¶

In [28]:

import pandas as pd

In [29]:

#importing a simple csv file with pd.read_csv
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df.head()

Out[29]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
0	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	7.327584e+08	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com
1	333320	6/7/2015 20:39	6/8/2015 1:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	3.503114e+08	0	apparels	1/2/1962	Rick D Rothackerj	rick_rothackerj@yahoo.com
2	1359	1/1/2015 18:52	1/1/2015 18:52	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	3/5/1962	Harriet M McLeodd	harriet.mcleodd@gmail.com
3	150084	4/28/2015 21:13	5/4/2015 13:54	43650	ATGTXKYKUDUQN	SEO	Safari	M	3.840542e+09	0	health_care	7/3/1974	Sinead M Carews	sinead.carews@gmail.com
4	221365	7/21/2015 7:09	9/9/2015 18:40	45016	NAUITBZFJKHWW	Ads	Safari	M	4.155831e+08	0	home_essentials	8/25/1970	Laurence H Frosty	laurence.frosty@tntech.edu

In [30]:

#importing a csv file where data starts at line#5, first the incorrect way
df0 = pd.read_csv("input_datsets/Sample_Fraud_Detection _with_comments.csv")
df0.head()

Out[30]:

	This is a sample data	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13
0	it has 14 columns	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	it has 1000 records	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
4	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	732758368.8	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com

In [31]:

#importing a csv file where data starts at line#5, now the correct way
df1 = pd.read_csv("input_datsets/Sample_Fraud_Detection _with_comments.csv", skiprows= 4)
df1.columns = df.columns
df1.head()

Out[31]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
0	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	7.327584e+08	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com
1	333320	6/7/2015 20:39	6/8/2015 1:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	3.503114e+08	0	apparels	1/2/1962	Rick D Rothackerj	rick_rothackerj@yahoo.com
2	1359	1/1/2015 18:52	1/1/2015 18:52	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	3/5/1962	Harriet M McLeodd	harriet.mcleodd@gmail.com
3	150084	4/28/2015 21:13	5/4/2015 13:54	43650	ATGTXKYKUDUQN	SEO	Safari	M	3.840542e+09	0	health_care	7/3/1974	Sinead M Carews	sinead.carews@gmail.com
4	221365	7/21/2015 7:09	9/9/2015 18:40	45016	NAUITBZFJKHWW	Ads	Safari	M	4.155831e+08	0	home_essentials	8/25/1970	Laurence H Frosty	laurence.frosty@tntech.edu

In [32]:

# Skiprow method can also be used along with nrows option to read specific rows of the data, especially when data is huge
df1 = pd.read_csv("input_datsets/Sample_Fraud_Detection _with_comments.csv", skiprows= 20, nrows = 5)
df1.columns = df.columns
df1.head()

Out[32]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	category	dob	name	email
0	78986	5/15/2015 3:52	8/11/2015 2:29	66143	TGHVAWBMZRDHH	SEO	FireFox	M	3.503883e+09	apparels	4/12/1992	Simon W Jessope	simon.jessope@gmail.com
1	119824	3/20/2015 0:31	4/5/2015 7:31	51106	WFIIFCPIOGMHT	Ads	Safari	M	1.314238e+05	cosmetics	1/16/1977	Herbert P Lashb	herbert.lashb@gmail.com
2	357386	2/3/2015 0:48	3/24/2015 18:27	63268	NWSVDOHYIOBDR	Ads	FireFox	M	3.037372e+09	apparels	9/20/1991	Tim Z Hepherp	tim.hepherp@hotmail.com
3	289172	7/17/2015 5:48	11/12/2015 22:08	46018	KFZGQIWDIRLZB	Direct	FireFox	F	1.044590e+09	cosmetics	8/19/1962	Isabel U Reynoldsb	isabel_reynoldsb@rigginglifting.com
4	154699	7/8/2015 15:35	7/20/2015 0:11	23319	TYZPZLEWNJJRI	Direct	FireFox	F	3.847612e+09	apparels	10/6/1992	Christopher Z Doeringv	christopher_doeringv@gmail.com

In [33]:

#importing selective columns
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv", usecols= ['user_id','purchase_value','purchase_time'])
df.head()

Out[33]:

	user_id	purchase_time	purchase_value
0	22058	4/18/2015 2:47	65278
1	333320	6/8/2015 1:38	96399
2	1359	1/1/2015 18:52	57296
3	150084	5/4/2015 13:54	43650
4	221365	9/9/2015 18:40	45016

Let's now read Excel file¶

In [34]:

#importing an excel file with pd.read_excel
df_read_excel = pd.read_excel("input_datsets/Sample_Fraud_Detection.xlsx") # will import the first tab by default
excel_data_df = pd.read_excel('input_datsets/Sample_Fraud_Detection.xlsx', sheet_name='Purchase')  # will import the specific tab
excel_data_df.head()

Out[34]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
0	22058	2015-02-24 22:55:49	2015-04-18 02:47:11	65278	QVPSPJUOCKZAR	SEO	Chrome	M	7.327584e+08	0	home_essentials	1976-02-22	Aaron P Maashoh	aaron.maashoh@yahoo.com
1	333320	2015-06-07 20:39:50	2015-06-08 01:38:54	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	3.503114e+08	0	apparels	1962-01-02	Rick D Rothackerj	rick_rothackerj@yahoo.com
2	1359	2015-01-01 18:52:44	2015-01-01 18:52:45	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	1962-03-05	Harriet M McLeodd	harriet.mcleodd@gmail.com
3	150084	2015-04-28 21:13:25	2015-05-04 13:54:50	43650	ATGTXKYKUDUQN	SEO	Safari	M	3.840542e+09	0	health_care	1974-07-03	Sinead M Carews	sinead.carews@gmail.com
4	221365	2015-07-21 07:09:52	2015-09-09 18:40:53	45016	NAUITBZFJKHWW	Ads	Safari	M	4.155831e+08	0	home_essentials	1970-08-25	Laurence H Frosty	laurence.frosty@tntech.edu

Read Json (JavaScript Object Notation) files¶

In [35]:

# Recently, Json has become a popular format of data. Let's learn how to read and normalize a Json data file

In [36]:

# read the json file
data1 = pd.read_json('input_datsets/sample1.json')
data1

Out[36]:

	fruit	size	color
0	Apple	Large	Red

In [37]:

data2 = pd.read_json('input_datsets/sample2.json')
data2

Out[37]:

	firstName	lastName	gender	age	address	phoneNumbers
0	Joe	Jackson	male	28	{'streetAddress': '101', 'city': 'San Diego', ...	[{'type': 'home', 'number': '7349282382'}]

In [38]:

#Above data is difficult to digest, let's make it more edible

In [39]:

data2_part_1 = data2[['firstName','lastName','gender','age']]

In [40]:

# Address can be normalized following way, since it is a dictionary
Address = pd.json_normalize(data2.address, max_level=1)
Address

Out[40]:

	streetAddress	city	state
0	101	San Diego	CA

In [41]:

# phoneNumbers can be normalized following way, since it is a dictionary within a List
phoneNumbers = pd.json_normalize(data2.phoneNumbers[0], max_level=1)
phoneNumbers

Out[41]:

	type	number
0	home	7349282382

In [42]:

FINAL_data2 = pd.concat([data2_part_1,Address,phoneNumbers],axis = 1)
FINAL_data2

Out[42]:

	firstName	lastName	gender	age	streetAddress	city	state	type	number
0	Joe	Jackson	male	28	101	San Diego	CA	home	7349282382

In [43]:

data3 = pd.read_json('input_datsets/sample3.json')
data3

Out[43]:

	student_roll_no	details
0	101	{'name': 'Andew', 'age': 12, 'grade': 'A'}
1	102	{'name': 'Bhuvan', 'age': 18, 'grade': 'B'}
2	103	{'name': 'Clinton', 'age': 11, 'grade': 'A'}
3	104	{'name': 'Drake', 'age': 12, 'grade': 'C'}
4	105	{'name': 'Eisha', 'age': 13, 'grade': 'B'}
5	106	{'name': 'Farhan', 'age': 22, 'grade': 'C'}
6	107	{'name': 'Garima', 'age': 11, 'grade': 'A'}
7	108	{'name': 'Himanshu', 'age': 19, 'grade': 'A'}
8	109	{'name': 'Ishaan', 'age': 10, 'grade': 'D'}
9	110	{'name': 'Jason', 'age': 9, 'grade': 'B'}

In [44]:

student_roll_no = data3.student_roll_no
details = pd.json_normalize(data3.details)
FINAL_data3 = pd.concat([student_roll_no,details],axis = 1)
FINAL_data3

Out[44]:

	student_roll_no	name	age	grade
0	101	Andew	12	A
1	102	Bhuvan	18	B
2	103	Clinton	11	A
3	104	Drake	12	C
4	105	Eisha	13	B
5	106	Farhan	22	C
6	107	Garima	11	A
7	108	Himanshu	19	A
8	109	Ishaan	10	D
9	110	Jason	9	B

In [45]:

data4 = pd.read_json('input_datsets/sample4.json')
data4

Out[45]:

	people
0	{'firstName': 'Joe', 'lastName': 'Jackson', 'g...
1	{'firstName': 'James', 'lastName': 'Smith', 'g...
2	{'firstName': 'Emily', 'lastName': 'Jones', 'g...

In [46]:

# phoneNumbers can be normalized following way, since it is a dictionary within a List
data4_1 = pd.json_normalize(data4.people[0], max_level=1)
data4_1

Out[46]:

	firstName	lastName	gender	age	number
0	Joe	Jackson	male	28	7349282382

!! There is also a json module in Python for reading complex json files.

Module 2: Let's do some basic check of data¶

In [47]:

# check top 5 rows
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df.head()

Out[47]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
0	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	7.327584e+08	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com
1	333320	6/7/2015 20:39	6/8/2015 1:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	3.503114e+08	0	apparels	1/2/1962	Rick D Rothackerj	rick_rothackerj@yahoo.com
2	1359	1/1/2015 18:52	1/1/2015 18:52	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	3/5/1962	Harriet M McLeodd	harriet.mcleodd@gmail.com
3	150084	4/28/2015 21:13	5/4/2015 13:54	43650	ATGTXKYKUDUQN	SEO	Safari	M	3.840542e+09	0	health_care	7/3/1974	Sinead M Carews	sinead.carews@gmail.com
4	221365	7/21/2015 7:09	9/9/2015 18:40	45016	NAUITBZFJKHWW	Ads	Safari	M	4.155831e+08	0	home_essentials	8/25/1970	Laurence H Frosty	laurence.frosty@tntech.edu

In [48]:

# check bottom 4 rows
df.tail(4)

Out[48]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	category	dob	name	email
995	328579	4/10/2015 1:02	6/25/2015 2:05	54775	UQCGEFBSOAVIJ	SEO	Chrome	M	1.038384e+09	electronics	4/6/1994	Praveen L Menonw	praveen.menonw@gmail.com
996	263174	7/14/2015 2:13	7/24/2015 1:25	84302	HVSGFVZSDBKEH	Ads	Safari	M	4.169080e+09	electronics	9/1/1979	Ra T Alpern	ra.alpern@flossresearch.com
997	117191	2/28/2015 17:14	4/14/2015 15:39	20193	GDHCTKIKPHENW	Direct	IE	M	1.411318e+09	apparels	11/17/1979	Julia D Edwardsj	julia.edwardsj@sutletgroup.com
998	65732	1/11/2015 11:10	1/22/2015 4:16	26225	VSMNAOFPSEQOL	Ads	IE	M	3.765208e+09	apparels	3/23/1988	Conor V Humphriesk	conor.humphriesk@gmail.com

In [49]:

# Basic summarization of a variable  "below \n has been used to mark end of the line"
print(
df['purchase_value'].mean(),"\n",
df['purchase_value'].max(),"\n",
df['purchase_value'].min(),"\n",
df['purchase_value'].sum()
     )

50992.49149149149 
 99743 
 1176 
 50941499

In [50]:

# mean of all the numeric variables
df.mean(numeric_only=True)

Out[50]:

user_id           1.963437e+05
purchase_value    5.099249e+04
ip_address        2.099747e+09
class             1.011011e-01
dtype: float64

In [51]:

df[['class','purchase_value']].sum()

Out[51]:

class                  101
purchase_value    50941499
dtype: int64

In [52]:

#Default is the descriptive statistics of only the numeric columns
df.describe()

Out[52]:

	user_id	purchase_value	ip_address	class
count	999.000000	999.000000	9.990000e+02	999.000000
mean	196343.719720	50992.491491	2.099747e+09	0.101101
std	116166.717945	28681.076653	1.262366e+09	0.301614
min	170.000000	1176.000000	1.314238e+05	0.000000
25%	95017.500000	26038.500000	9.856234e+08	0.000000
50%	194882.000000	52507.000000	2.076032e+09	0.000000
75%	297765.000000	75653.500000	3.196621e+09	0.000000
max	399975.000000	99743.000000	4.294263e+09	1.000000

In [53]:

#Descriptive statistics of all categorical column
df.describe(include=[object])

Out[53]:

	signup_time	purchase_time	device_id	source	browser	sex	category	dob	name	email
count	999	999	999	999	999	974	999	970	999	999
unique	995	998	994	3	5	2	5	917	999	998
top	1/7/2015 16:57	1/11/2015 16:59	BWSMVSLCJXMCM	Ads	Chrome	M	cosmetics	2/1/1993	Aaron P Maashoh	john.odonnell@gmail.com
freq	2	2	2	408	397	560	210	3	1	2

In [54]:

#Descriptive statistics of any column
df["device_id"].describe()

Out[54]:

count               999
unique              994
top       BWSMVSLCJXMCM
freq                  2
Name: device_id, dtype: object

Use following codes to identify how many NULL values are there in each field¶

In [55]:

df.isna().sum()

Out[55]:

user_id            0
signup_time        0
purchase_time      0
purchase_value     0
device_id          0
source             0
browser            0
sex               25
ip_address         0
class              0
category           0
dob               29
name               0
email              0
dtype: int64

In [56]:

df.isnull().sum()  # sames as above

Out[56]:

user_id            0
signup_time        0
purchase_time      0
purchase_value     0
device_id          0
source             0
browser            0
sex               25
ip_address         0
class              0
category           0
dob               29
name               0
email              0
dtype: int64

In [57]:

df.nunique()   # gives count distinct values in each field

Out[57]:

user_id           999
signup_time       995
purchase_time     998
purchase_value    994
device_id         994
source              3
browser             5
sex                 2
ip_address        995
class               2
category            5
dob               917
name              999
email             998
dtype: int64

In [58]:

#Unique values of a single column
df.browser.nunique()

Out[58]:

In [59]:

df.browser.value_counts()

Out[59]:

Chrome     397
IE         260
Safari     166
FireFox    155
Opera       21
Name: browser, dtype: int64

.count_values() - It returns counts of unique values of all categories of a variable. It can also be used to create a cross-frequcncy table.¶

This is equivalent to proc Freq functionality in SAS¶

In [60]:

x = df[['browser','source']].value_counts() # Saving the results in a DataFrame
print(x)

browser  source
Chrome   SEO       165
         Ads       151
IE       Ads       109
         SEO       103
Chrome   Direct     81
Safari   Ads        75
FireFox  Ads        64
         SEO        64
Safari   SEO        58
IE       Direct     48
Safari   Direct     33
FireFox  Direct     27
Opera    Ads         9
         SEO         9
         Direct      3
dtype: int64

In [61]:

# Another way to do frequency analysis is - crosstab. 
# Crosstab is frequency analysis with two variables
pd.crosstab(df['browser'], df['sex'])

Out[61]:

sex	F	M
browser
Chrome	167	218
FireFox	61	89
IE	110	144
Opera	8	11
Safari	68	98

In [62]:

#More on such summarization will be covered in Module #14

Module 3: Useful fuctions to get basic information about the data¶

This is equivalent to Proc Contents of SAS, which gives out the Meta-data of the dataset¶

.info()- will also give information about missing values¶

In [63]:

df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   user_id         999 non-null    int64  
 1   signup_time     999 non-null    object 
 2   purchase_time   999 non-null    object 
 3   purchase_value  999 non-null    int64  
 4   device_id       999 non-null    object 
 5   source          999 non-null    object 
 6   browser         999 non-null    object 
 7   sex             974 non-null    object 
 8   ip_address      999 non-null    float64
 9   class           999 non-null    int64  
 10  category        999 non-null    object 
 11  dob             970 non-null    object 
 12  name            999 non-null    object 
 13  email           999 non-null    object 
dtypes: float64(1), int64(3), object(10)
memory usage: 109.4+ KB

In [64]:

df.columns

Out[64]:

Index(['user_id', 'signup_time', 'purchase_time', 'purchase_value',
       'device_id', 'source', 'browser', 'sex', 'ip_address', 'class',
       'category', 'dob', 'name', 'email'],
      dtype='object')

In [65]:

len(df)  # gives numbers of records

Out[65]:

In [66]:

df.shape  # gives numbers of records and number of fields in data

Out[66]:

(999, 14)

Module 4 : Subsetting data Vertically and Horizontally (selecting subset of a DataFrame)¶

Selecting only few columns from a DataFrame¶

In [67]:

df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
sample = df.name
sample.head(3)

Out[67]:

0      Aaron P Maashoh
1    Rick D Rothackerj
2    Harriet M McLeodd
Name: name, dtype: object

In [68]:

sample = df[["name","dob","sex"]]
sample.head(3)

Out[68]:

	name	dob	sex
0	Aaron P Maashoh	2/22/1976	M
1	Rick D Rothackerj	1/2/1962	F
2	Harriet M McLeodd	3/5/1962	M

In [69]:

#### Selecting few rows from DataFrame

In [70]:

s10 = df[10:13]      # this will give us data from 10th to 12th record
s10

Out[70]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	category	dob	name	email
10	199700	7/11/2015 18:26	10/28/2015 21:59	7603	TEPSJVVXGNTYR	Ads	Safari	F	1.819009e+09	health_care	6/17/1980	Matt U Falloonz	matt.falloonz@brandermill.com
11	73884	5/29/2015 16:22	6/16/2015 5:45	69508	ZTZZJUCRDOCJZ	Direct	Chrome	M	4.038285e+09	cosmetics	6/7/1983	Emily H Flitterl	emily.flitterl@tntech.edu
12	79203	6/16/2015 21:19	6/21/2015 3:29	31999	IBPNKSMCKUZWD	SEO	Safari	M	4.161541e+09	apparels	6/19/1982	Olivia X Oranr	olivia_oranr@sutletgroup.com

Location (index) and label based subsetting of data¶

.loc[] and iloc[] for accessing the elements of a dataframe¶

.loc[] is LABEL-based to select rows and/or columns in pandas¶

.iloc[] is a INDEX-based to select rows and/or columns in pandas.¶

In [71]:

# Dictionary with key:value pair
technologies = {'Courses':["Spark","PySpark","Hadoop","Python","pandas"],
                'Fee' :[20000,25000,26000,22000,24000],
                'Duration':['30day','40days','35days','40days','60days'],
                'Discount':[1000,2300,1200,2500,2000]
              }

simple_data = pd.DataFrame(technologies)
simple_data

Out[71]:

	Courses	Fee	Duration	Discount
0	Spark	20000	30day	1000
1	PySpark	25000	40days	2300
2	Hadoop	26000	35days	1200
3	Python	22000	40days	2500
4	pandas	24000	60days	2000

We can make a column as Index as well¶

In [72]:

new_data = simple_data.set_index('Courses')
new_data

Out[72]:

	Fee	Duration	Discount
Courses
Spark	20000	30day	1000
PySpark	25000	40days	2300
Hadoop	26000	35days	1200
Python	22000	40days	2500
pandas	24000	60days	2000

or can make a custom index¶

In [73]:

index_labels=['row1','row2','row3','row4','row5']

data_with_named_index = pd.DataFrame(technologies,index=index_labels)
data_with_named_index

Out[73]:

	Courses	Fee	Duration	Discount
row1	Spark	20000	30day	1000
row2	PySpark	25000	40days	2300
row3	Hadoop	26000	35days	1200
row4	Python	22000	40days	2500
row5	pandas	24000	60days	2000

In [74]:

# Select Single Row by Index 'Label': Selecting a single column using the ".loc" attribute
simple_data.loc[1]

Out[74]:

Courses     PySpark
Fee           25000
Duration     40days
Discount       2300
Name: 1, dtype: object

In [75]:

# Now let's try to use loc on the dataframe with named Index -- This should not work, You can cry without #
#data_with_named_index.loc[1]

In [76]:

# Select Single Row by Index Label: Selecting a single column using the ".loc" attribute
print(data_with_named_index.loc['row2'])

Courses     PySpark
Fee           25000
Duration     40days
Discount       2300
Name: row2, dtype: object

In [77]:

# Select Single Row by Index
print(data_with_named_index.iloc[1])

Courses     PySpark
Fee           25000
Duration     40days
Discount       2300
Name: row2, dtype: object

In [78]:

#this will give error
# print(data_with_named_index.iloc['row2']) 

In [79]:

 # this will give error
# df.loc[:,1:5]

In [80]:

# Select a set of Columns between two Labels(column-names)
slice0 = df.loc[:,'device_id':'class']
slice0.head(3)

Out[80]:

	device_id	source	browser	sex	ip_address	class
0	QVPSPJUOCKZAR	SEO	Chrome	M	7.327584e+08	0
1	EOGFQPIZPYXFZ	Ads	Chrome	F	3.503114e+08	0
2	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1

In [81]:

# Select Columns between two Indexes: 
# Includes Index 1 & Excludes 4
slice00 = df.iloc[:,1:4]
slice00.head(3)

Out[81]:

	signup_time	purchase_time	purchase_value
0	2/24/2015 22:55	4/18/2015 2:47	65278
1	6/7/2015 20:39	6/8/2015 1:38	96399
2	1/1/2015 18:52	1/1/2015 18:52	57296

In [82]:

#Access elements with index label 40 to 100 by 10, we make use of .loc[]
slice1 = df.loc[40:70:10]
slice1

Out[82]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
40	81113	8/16/2015 12:56	8/24/2015 5:00	31921	BKQVBSSFGETUQ	Direct	Chrome	F	3.682312e+09	0	home_essentials	11/14/1971	Michael L Ermanl	michael.ermanl@brandermill.com
50	254562	5/12/2015 18:58	7/31/2015 11:36	56245	VYSNKKTTQIXQC	Ads	FireFox	F	2.977122e+09	0	health_care	7/5/1970	Sinead P Carewi	sinead.carewi@yahoo.com
60	324683	1/18/2015 2:51	4/19/2015 22:24	34657	JFVSIORUJBMXS	Ads	IE	M	4.228028e+09	1	home_essentials	NaN	Dhanya E Skariachand	dhanya.skariachand@gmail.com
70	99299	7/22/2015 17:08	8/10/2015 20:44	65902	JTHVCCWTCLEWQ	Ads	IE	F	3.660879e+09	0	home_essentials	10/26/1991	Laura W Noonanp	laura.noonanp@yahoo.com

In [83]:

# selecting 0th, 2th, 4th, and 7th index rows
slice2 = df.loc[[0, 2, 4, 7]]
slice2

Out[83]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
0	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	7.327584e+08	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com
2	1359	1/1/2015 18:52	1/1/2015 18:52	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	3/5/1962	Harriet M McLeodd	harriet.mcleodd@gmail.com
4	221365	7/21/2015 7:09	9/9/2015 18:40	45016	NAUITBZFJKHWW	Ads	Safari	M	4.155831e+08	0	home_essentials	8/25/1970	Laurence H Frosty	laurence.frosty@tntech.edu
7	360585	4/6/2015 7:35	5/25/2015 17:21	74769	HPUCUYLMJBYFW	Ads	Opera	M	1.692459e+09	0	health_care	11/24/1981	Lisa E Baertleinu	lisa.baertleinu@hotmail.com

In [84]:

#selecting all rows and only the columns from 4 to 7
slice3 = df.iloc[:,4:7]
print(slice3.head(),"\n",slice3.shape)

       device_id source browser
0  QVPSPJUOCKZAR    SEO  Chrome
1  EOGFQPIZPYXFZ    Ads  Chrome
2  YSSKYOSJHPPLJ    SEO   Opera
3  ATGTXKYKUDUQN    SEO  Safari
4  NAUITBZFJKHWW    Ads  Safari 
 (999, 3)

In [85]:

# selecting rows from 40 to 60 by 10 and columns from 4 to 7
slice4 = df.iloc[40:70:10,4:7]
slice4

Out[85]:

	device_id	source	browser
40	BKQVBSSFGETUQ	Direct	Chrome
50	VYSNKKTTQIXQC	Ads	FireFox
60	JFVSIORUJBMXS	Ads	IE

How to drop variables?¶

In [86]:

# option axis= is optional, but default is 0.
# 0 : 'index'
# 1 : 'columns'

In [87]:

# Remove columns by column-names
s1 = df.drop(['dob','name','email'], axis=1)
s1.head(2)    

Out[87]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category
0	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	732758368.8	0	home_essentials
1	333320	6/7/2015 20:39	6/8/2015 1:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	350311387.9	0	apparels

In [88]:

# Remove all columns between column index 1 to 4
s2 = df.drop(df.iloc[:, 1:5], axis=1)
s2.head(2)

Out[88]:

	user_id	source	browser	sex	ip_address	class	category	dob	name	email
0	22058	SEO	Chrome	M	732758368.8	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com
1	333320	Ads	Chrome	F	350311387.9	0	apparels	1/2/1962	Rick D Rothackerj	rick_rothackerj@yahoo.com

How to rename variables?¶

In [89]:

r1 = df.rename(columns = {'dob':'date_of_birth', 'sex':'gender'})
r1.head(2)

Out[89]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	gender	ip_address	class	category	date_of_birth	name	email
0	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	732758368.8	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com
1	333320	6/7/2015 20:39	6/8/2015 1:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	350311387.9	0	apparels	1/2/1962	Rick D Rothackerj	rick_rothackerj@yahoo.com

In [90]:

# Use option (inplace = True) if willing to make changes to the existing DataFrame 
r1.rename(columns = {'date_of_birth':'Birth_date'}, inplace = True)
r1.head(2)

Out[90]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	gender	ip_address	class	category	Birth_date	name	email
0	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	732758368.8	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com
1	333320	6/7/2015 20:39	6/8/2015 1:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	350311387.9	0	apparels	1/2/1962	Rick D Rothackerj	rick_rothackerj@yahoo.com

Module 5: Conditional filter and LOGICAL operators similar to where clause in SAS/SQL¶

Extract rows with a single condition using operators¶

In [91]:

above = df[df.purchase_value > 90000]
above.purchase_value.min()

Out[91]:

Extract rows with multiple conditions using operators " | ". It filters data similar to OR operater in where caluse of SAS/SQL.¶

In [92]:

chrm_ie = df[(df.browser == "Chrome") | (df.browser == "IE")]
chrm_ie.browser.value_counts()

Out[92]:

Chrome    397
IE        260
Name: browser, dtype: int64

.isin() : multiple "OR" conditions over a single column¶

In [93]:

chrm_ie = df[df.browser.isin(["Chrome", "IE","FireFox"])]
chrm_ie.browser.value_counts()

Out[93]:

Chrome     397
IE         260
FireFox    155
Name: browser, dtype: int64

use "&" symbol in pandas as an “AND” operator of SQL. df [(condition1) & (condition2)]¶

In [94]:

above_chrm = df[(df.purchase_value > 90000) & (df.browser != "Chrome")]
above_chrm.purchase_value.min()

Out[94]:

Using a NOT operator, to inverse above condition¶

In [95]:

above_chrm = df[~((df.purchase_value > 90000) & (df.browser != "Chrome"))]
above_chrm.purchase_value.max()

Out[95]:

use "between" to filter a numerical column between a range of values¶

In [96]:

btw = df[df['purchase_value'].between(80000, 90000)]

btw.purchase_value.min(), btw.purchase_value.max()

Out[96]:

(80046, 89915)

How to use contains condition in where clause¶

In [97]:

data_subset = df[df['source'].str.contains("SE")]
data_subset.source.value_counts()

Out[97]:

SEO    399
Name: source, dtype: int64

Unlike SAS and SQL, there is no direct LIKE % functionality in Python or Pandas, however we can use it indirectly or using regular expressions. We have covered it later in this tutorial.¶

Module 6 : Derive a new column(s)¶

SELECT *, purchase_value/82.52 as convt FROM tips;

In [98]:

df["Value_in_USD"] = df.purchase_value/82.52
df[["purchase_value","Value_in_USD"]].head()

Out[98]:

	purchase_value	Value_in_USD
0	65278	791.056714
1	96399	1168.189530
2	57296	694.328648
3	43650	528.962676
4	45016	545.516238

map() function for series [one column / field] is used to substitute each value in a series with another value¶

In [99]:

# "f" : This formatter is specifically used for formatting float values
# {,.2f} : Adds a comma and keeps to two decimal points
# {.2f} : keeps to two decimal points
# "$" : used to add a dollar currency sign

# You can mix and match from above option, like the below example,
# we are using a mix of dollar, comma and rounding to two decimal points [acutal value doesn't change]

In [100]:

df["Value_in_USD"] = (df.purchase_value/82.52).map('${:}'.format)
df[["purchase_value","Value_in_USD"]].head(2)

Out[100]:

	purchase_value	Value_in_USD
0	65278	$791.0567135239942
1	96399	$1168.189529810955

In [101]:

df["Value_in_USD"] = (df.purchase_value/82.52).map('${: ,.2f}'.format)
df[["purchase_value","Value_in_USD"]].head(2)

Out[101]:

	purchase_value	Value_in_USD
0	65278	$ 791.06
1	96399	$ 1,168.19

Add counter : column with incremental values from 1 onwards with range(start, stop, step)¶

In [102]:

start = 1
df['increment_by_1'] = range(start, start + df.shape[0])
df[['user_id', 'increment_by_1']].head()

Out[102]:

	user_id	increment_by_1
0	22058	1
1	333320	2
2	1359	3
3	150084	4
4	221365	5

In [103]:

print('Count of row :', df.shape[0], "\n",
      'Count of columns :', df.shape[1])

Count of row : 999 
 Count of columns : 16

In [104]:

# with step size 5
start = 0
df['increment_by_5'] = range(start, start + (5 * df.shape[0]), 5)
df[['user_id', 'increment_by_1', 'increment_by_5']].head()

Out[104]:

	user_id	increment_by_1	increment_by_5
0	22058	1	0
1	333320	2	5
2	1359	3	10
3	150084	4	15
4	221365	5	20

Module 6.1 :Text functions¶

.str has to be prefixed to tell the compiler that a Pandas function is being called.¶

.str.len() - finds the length of a character string¶

In [105]:

df["email_length"] = df.email.str.len()
df[["email","email_length"]].head()

Out[105]:

	email	email_length
0	aaron.maashoh@yahoo.com	23
1	rick_rothackerj@yahoo.com	25
2	harriet.mcleodd@gmail.com	25
3	sinead.carews@gmail.com	23
4	laurence.frosty@tntech.edu	26

.str.find() - finds the first occurenece position of a character in a column of strings, this is CASE Sensitive function¶

In [106]:

df["Position_of_@"] = df.email.str.find('@')
df["Position_of_a"] = df.email.str.find('a')
df[["email", "Position_of_@",'Position_of_a']].head()

Out[106]:

	email	Position_of_@	Position_of_a
0	aaron.maashoh@yahoo.com	13	0
1	rick_rothackerj@yahoo.com	15	9
2	harriet.mcleodd@gmail.com	15	1
3	sinead.carews@gmail.com	13	4
4	laurence.frosty@tntech.edu	15	1

.str.findall() - It is similar to the above mentioned find() function, however, it will return ALL the positions of the character instead¶

In [107]:

df["all_a"] = df.email.str.findall("a")
df[["email","all_a"]].head()

Out[107]:

	email	all_a
0	aaron.maashoh@yahoo.com	[a, a, a, a, a]
1	rick_rothackerj@yahoo.com	[a, a]
2	harriet.mcleodd@gmail.com	[a, a]
3	sinead.carews@gmail.com	[a, a, a]
4	laurence.frosty@tntech.edu	[a]

In [108]:

df["all_a_length"] = df.all_a.str.len()
df[["email","all_a", "all_a_length"]].head()

Out[108]:

	email	all_a	all_a_length
0	aaron.maashoh@yahoo.com	[a, a, a, a, a]	5
1	rick_rothackerj@yahoo.com	[a, a]	2
2	harriet.mcleodd@gmail.com	[a, a]	2
3	sinead.carews@gmail.com	[a, a, a]	3
4	laurence.frosty@tntech.edu	[a]	1

.str.split( )[ ] - extracts a word in a column based on a delimiter(splitter) and then reference the word by index, it is similar to the SCAN function in SAS¶

This is simple version of code, however later all the functions will be explained using Regular Expressions, which are more powerful, versatile and recommended¶

In [109]:

#expand - Expand the splitted strings into separate columns.
          #If True, return DataFrame/MultiIndex expanding dimensionality.
          #If False, return Series/Index, containing lists of strings.
df["email_id"] = df.email.str.split("@", expand = True)[0]
df["email_provider"] = df.email.str.split("@", expand = True)[1]
df[["email","email_id","email_provider"]].head()

Out[109]:

	email	email_id	email_provider
0	aaron.maashoh@yahoo.com	aaron.maashoh	yahoo.com
1	rick_rothackerj@yahoo.com	rick_rothackerj	yahoo.com
2	harriet.mcleodd@gmail.com	harriet.mcleodd	gmail.com
3	sinead.carews@gmail.com	sinead.carews	gmail.com
4	laurence.frosty@tntech.edu	laurence.frosty	tntech.edu

In [110]:

df.email.str.split("@",expand=False)

Out[110]:

0             [aaron.maashoh, yahoo.com]
1           [rick_rothackerj, yahoo.com]
2           [harriet.mcleodd, gmail.com]
3             [sinead.carews, gmail.com]
4          [laurence.frosty, tntech.edu]
                     ...                
994         [edward_krudyl, hotmail.com]
995          [praveen.menonw, gmail.com]
996       [ra.alpern, flossresearch.com]
997    [julia.edwardsj, sutletgroup.com]
998        [conor.humphriesk, gmail.com]
Name: email, Length: 999, dtype: object

In [111]:

df['fname'] = df.name.str.split(expand=True)[0]
df['mname'] = df.name.str.split(expand=True)[1]
df['lname'] = df.name.str.split(expand=True)[2]
df[['name','fname','mname','lname']].head()

# please note, default delimiter in pandas is SPACE

Out[111]:

	name	fname	mname	lname
0	Aaron P Maashoh	Aaron	P	Maashoh
1	Rick D Rothackerj	Rick	D	Rothackerj
2	Harriet M McLeodd	Harriet	M	McLeodd
3	Sinead M Carews	Sinead	M	Carews
4	Laurence H Frosty	Laurence	H	Frosty

.str[ : ] - extracts a substring from a string based on its position¶

please remember, index in Python starts with 0 so 2 means third position and 6th means 7th position, \n however last position is not included, hence it will return from 3rd to 6th position of the string¶

In [112]:

df['positional_part_of_string'] = df.device_id.str[2:6]
df[['device_id','positional_part_of_string']].head()

Out[112]:

	device_id	positional_part_of_string
0	QVPSPJUOCKZAR	PSPJ
1	EOGFQPIZPYXFZ	GFQP
2	YSSKYOSJHPPLJ	SKYO
3	ATGTXKYKUDUQN	GTXK
4	NAUITBZFJKHWW	UITB

.str.upper() .str.lower() .str.title()¶

In [113]:

df['uppercase'] = df.name.str.upper()
df['lowercase'] = df.name.str.lower()
df['propercase'] = df.name.str.title()
df[['uppercase', 'lowercase', 'propercase']].head()

Out[113]:

	uppercase	lowercase	propercase
0	AARON P MAASHOH	aaron p maashoh	Aaron P Maashoh
1	RICK D ROTHACKERJ	rick d rothackerj	Rick D Rothackerj
2	HARRIET M MCLEODD	harriet m mcleodd	Harriet M Mcleodd
3	SINEAD M CAREWS	sinead m carews	Sinead M Carews
4	LAURENCE H FROSTY	laurence h frosty	Laurence H Frosty

.str.contains() - checks if the string contains a particular substring¶

In [114]:

df['cont'] = df.category.str.contains("_")
df[['category','cont']].head()

Out[114]:

	category	cont
0	home_essentials	True
1	apparels	False
2	electronics	False
3	health_care	True
4	home_essentials	True

How do we apply above formula to a data seubseting¶

In [115]:

new_data = df[df.category.str.contains("_")]
new_data.category

Out[115]:

0      home_essentials
3          health_care
4      home_essentials
7          health_care
10         health_care
            ...       
982        health_care
986    home_essentials
988        health_care
991    home_essentials
993        health_care
Name: category, Length: 381, dtype: object

str.replace() - to replace a part of the string with another one¶

In [116]:

df['email_new'] = df.email.str.replace('.com','.org',regex=False)
df['email_new'] = df.email_new.str.replace('.edu','.org',regex=False)
df[["email", "email_new"]].head()

Out[116]:

	email	email_new
0	aaron.maashoh@yahoo.com	aaron.maashoh@yahoo.org
1	rick_rothackerj@yahoo.com	rick_rothackerj@yahoo.org
2	harriet.mcleodd@gmail.com	harriet.mcleodd@gmail.org
3	sinead.carews@gmail.com	sinead.carews@gmail.org
4	laurence.frosty@tntech.edu	laurence.frosty@tntech.org

.replace() : It is not a string method, it provides a convenient way to use mappings or vocabularies to translate certain values. It requires a dictionary to define the mapping {from : to}.¶

In [117]:

df["sex_long"] = df["sex"].replace({"M": "Male", "F": "Female"})
df[["sex","sex_long"]].head()

Out[117]:

	sex	sex_long
0	M	Male
1	F	Female
2	M	Male
3	M	Male
4	M	Male

.str.cat() - Concatenating string columns in small datasets¶

In [118]:

df['fname'] = df.name.str.split(expand=True)[0]
df['mname'] = df.name.str.split(expand=True)[1]
df['lname'] = df.name.str.split(expand=True)[2]
df[['name','fname','mname','lname']].head()

df["Full_Name"] = df.fname.str.cat(df.lname)
df[["name","fname", "lname", "Full_Name"]].head()

df["Full_Name_NEW"] = df.fname.str.cat(df.lname, sep=' ')
df[['name',"fname", "lname",  "Full_Name", "Full_Name_NEW"]].head()

df["Full_Name_FULL"] = df.fname.str.cat(df.lname, sep=' ').str.cat(df.mname,sep=' ')
df[['name',"fname", "lname",  "Full_Name", "Full_Name_NEW","Full_Name_FULL"]].head()

Out[118]:

	name	fname	lname	Full_Name	Full_Name_NEW	Full_Name_FULL
0	Aaron P Maashoh	Aaron	Maashoh	AaronMaashoh	Aaron Maashoh	Aaron Maashoh P
1	Rick D Rothackerj	Rick	Rothackerj	RickRothackerj	Rick Rothackerj	Rick Rothackerj D
2	Harriet M McLeodd	Harriet	McLeodd	HarrietMcLeodd	Harriet McLeodd	Harriet McLeodd M
3	Sinead M Carews	Sinead	Carews	SineadCarews	Sinead Carews	Sinead Carews M
4	Laurence H Frosty	Laurence	Frosty	LaurenceFrosty	Laurence Frosty	Laurence Frosty H

"+" - Concatenating string columns in larger datasets¶

In [119]:

df["FULL_NAME"] = df.fname + " " + df.mname + " " + df.lname
df[["fname", "mname", "lname", "FULL_NAME"]].head()

Out[119]:

	fname	mname	lname	FULL_NAME
0	Aaron	P	Maashoh	Aaron P Maashoh
1	Rick	D	Rothackerj	Rick D Rothackerj
2	Harriet	M	McLeodd	Harriet M McLeodd
3	Sinead	M	Carews	Sinead M Carews
4	Laurence	H	Frosty	Laurence H Frosty

str.lstrip() is used to remove spaces from the left side of string¶

str.rstrip() to remove spaces from right side of the string and¶

str.strip() removes spaces from both sides¶

In [120]:

details = {'Name' : [' Christophor', 'Robin ', ' Samantha ', '  Julia   '],
            'Age' : [23, 21, 22, 21],
            'University' : ['BHU', 'JNU', 'DU', 'BHU'],
          }
sample = pd.DataFrame(details)
sample

Out[120]:

	Name	Age	University
0	Christophor	23	BHU
1	Robin	21	JNU
2	Samantha	22	DU
3	Julia	21	BHU

In [121]:

sample["leading_blanks"] = sample.Name.str.lstrip()
sample["trailing_blanks"] = sample.Name.str.rstrip()
sample["both_blanks"] = sample.Name.str.strip()
print(sample)

           Name  Age University leading_blanks trailing_blanks  both_blanks
0   Christophor   23        BHU    Christophor     Christophor  Christophor
1        Robin    21        JNU         Robin            Robin        Robin
2     Samantha    22         DU      Samantha         Samantha     Samantha
3      Julia      21        BHU       Julia              Julia        Julia

.str.[start:stop:step] - This expression extracts all the characters from start to stop and by stip as − 1 is to reverse the order¶

In [122]:

sample['reverse'] = sample.University.str[::-1]
sample[["University", "reverse"]]

Out[122]:

	University	reverse
0	BHU	UHB
1	JNU	UNJ
2	DU	UD
3	BHU	UHB

Module 7 : Regular Expressions (popular by name "regex")¶

Regular Expression is a sequence of characters that make up a pattern to find, replace, and extract textual data.¶

The built-in library in Python for handling regular expression is "re"¶

In [123]:

import re

*************************** Popular functions with re ***************************

1) re.findall(pattern, text) — Returns all the matched strings in a list.¶

In [124]:

text = 'There was further spike in negative sentiments with the outbreak.'
re.findall("the", text)

Out[124]:

['the', 'the']

*Note : Above code searches for all 'the' in the text and returns a list of all matched strings.

Use the parameter flags=re.I or flags=re.IGNORECASE for case-insensitive¶

In [125]:

text = 'There was further spike in negative sentiments with the outbreak.'
re.findall("the", text, flags=re.IGNORECASE)

Out[125]:

['The', 'the', 'the']

2) re.search(pattern, text) — Returns the first occurrence of the match as a match object.¶

In [126]:

text = 'There was further spike in negative sentiments with the outbreak.'
re.search("the", text)

Out[126]:

<re.Match object; span=(13, 16), match='the'>

Here, the match object contains information about the matched string: 1). its span (start and end position in the text), and 2). the match string itself To further extract the details of Match-object use : .group(), .span(), .start(), and .end()

In [127]:

print(re.search("the", text).group(),"\n",
      re.search("the", text).span(), "\n",
      re.search("the", text).start(), "\n",
      re.search("the", text).end() )

the 
 (13, 16) 
 13 
 16

3) re.finditer(pattern, text) — Returns an iterator of match objects that we then wrap with a list.¶

In [128]:

text = 'There was further spike in negative sentiments with the outbreak.'
match = re.finditer('the', text, flags=re.I)
x = list(match)

In [129]:

Out[129]:

[<re.Match object; span=(0, 3), match='The'>,
 <re.Match object; span=(13, 16), match='the'>,
 <re.Match object; span=(52, 55), match='the'>]

In [130]:

x[-1].span()

Out[130]:

(52, 55)

4) re.sub(pattern, repl, text) — Replaces the matched substring(s) with the 'repl' string.¶

This is equivalent to TRANWRD of SAS¶

In [131]:

text = 'my pincode is 201007'
re.sub('10', '**', text)

Out[131]:

'my pincode is 20**07'

5) re.split(pattern, text) — Splits the text at the position of the match(es), into elements in a list.¶

You can use this function as an equivalent of SCAN function of SAS¶

In [132]:

text = "We are excited to launch our new Product. #newcourse #machinelearning #python"
re.split("#", text)

Out[132]:

['We are excited to launch our new Product. ',
 'newcourse ',
 'machinelearning ',
 'python']

Suppose you need the second word with respect to a delimiter, in this case #

In [133]:

text = "We are excited to launch our new Product. #newcourse #machinelearning #python"
re.split("#", text)[1]

Out[133]:

'newcourse '

Regex MetaCharacters¶

In [134]:

# Basic meta charecters used in regular expressions
#    \ -> helps take then following character literally
#    . -> matches a single character
#    
#    * -> matches 'zero' or more occourrence of the previous charecter
#    + -> matches 'one' or more occourrence of the previous charecter
#    
#    ^ -> matches any character start of a string
#    $ -> matches any character end of a string
#    
#    [] -> matches one of the set of characters within []
#    [a-z] -> matches on of the range of characters in lowercase alphabet
#    [^abc] -> matches a character that is not a, b, or c
#   ?     -> Matches zero or one occurrence

1) '.' the dot character, or wildcard — this matches and returns any character in the string, except a new line.¶

In [135]:

text = "Wow!... We're @club_199."
print(re.search(r'.', text)) # gives the first match location
print(re.findall(r'.', text)) # gives all the matches

<re.Match object; span=(0, 1), match='W'>
['W', 'o', 'w', '!', '.', '.', '.', ' ', 'W', 'e', "'", 'r', 'e', ' ', '@', 'c', 'l', 'u', 'b', '_', '1', '9', '9', '.']

When we added a '\' before '.' it changed the meaning, now it will look for literal '.'¶

In [136]:

text = "Wow!... We're @club_199."
print(re.search('\.', text))  # gives the first match location
print(re.findall('\.', text))  # gives all the matches

<re.Match object; span=(4, 5), match='.'>
['.', '.', '.', '.']

2) '*' (zero or more) — Matches if the previous pattern appears zero or many times.¶

In [137]:

# It is little tricky to inerpret it, we will help you

# in below case, when we are looking for the* , we are actually asking to look for zero or more 'e' just after 'th'

# Trick to remember : stars can we zero on a cloudy day, and on a clear sky day we can have n number of stars

In [138]:

text = 'There was further spike in negative sentiments with the outbreak.'
re.findall('the*', text, flags = re.I)

Out[138]:

['The', 'the', 'th', 'the']

in below case, when we are looking for b* , we are actually asking to look for zero or more 'b' between 'a' and 'c' so it will find : ac, abc, abbbc, dabc but will NOT find abdc, since 'd' is spoiling the pattern

In [139]:

text = 'ac abc abbbc dabc abdc'
pattern = 'ab*c'
re.findall(pattern, text)

Out[139]:

['ac', 'abc', 'abbbc', 'abc']

3) '+' (once or more) — Matches if the previous pattern appears one or more times¶

it is same as *, with a contrast of not finding 'zero' occurence¶

In [140]:

text = 'There was further spike in negative sentiments with the outbreak.'
re.findall('the+', text, flags = re.I)

Out[140]:

['The', 'the', 'the']

In [141]:

text = 'ac abc abbbc dabc abdc'
pattern = 'ab+c'
re.findall(pattern, text)

Out[141]:

['abc', 'abbbc', 'abc']

In [142]:

text = '7.6% in 2020 now 2022/23 budget'
re.findall(r'\d+', text)

Out[142]:

['7', '6', '2020', '2022', '23']

4) ? (zero or once) — Matches if the previous pattern appears zero or one time only¶

In [143]:

text = 'There was further spike in negative sentiments with thee outbreak.'
re.findall('the?', text, flags = re.IGNORECASE)

Out[143]:

['The', 'the', 'th', 'the']

In [144]:

# You might be thinking why in the above example, why it looked for 
# 'the' in 'thee', while we said it looks for ONLY 0 and 1 char instance.
# Below example will make it more clear

In [145]:

text = 'ac abc abbbc dabc abdc'
pattern = 'ab?c'
re.findall(pattern, text)

Out[145]:

['ac', 'abc', 'abc']

?: — matches but doesn’t capture the group with Include ?:¶

In [146]:

text = 'On 23th total 42%b and on 17th total 35%'
re.findall(r'(\d+)(?:%)', text)

Out[146]:

['42', '35']

In [147]:

re.findall(r'(\w+)(?:l)', text)

Out[147]:

['tota', 'tota']

5). ^ — matches only the start of a text, and therefore ^ is written as the first character in the pattern.¶

In below example, '^5' means we are looking if the string is starting with '5'

In [148]:

#Starts with two digits
text = '500,000 units' '90,000 hp'
pattern = '^5'
re.findall(pattern, text)

Out[148]:

['5']

6) $ — matches the end of the string and is therefore written at the end of a pattern.¶

In below example, '^5' means we are looking if the string is ending with 'hp'

In [149]:

#Starts with two digits
text = '500,000 units' '90,000 hp'
pattern = 'hp$'
re.findall(pattern, text)

Out[149]:

['hp']

7) [ ] — matches ANY of the characters inside the square brackets.¶

In [150]:

text = "We're @club_007"
re.findall('[aeiou]', text)

Out[150]:

['e', 'e', 'u']

8) [^] Having a hat ^ character right after the opening square bracket negates the character set.¶

^ bascially works as a NOT operator

In [151]:

text = "We're @club_007"
re.findall('[^aeiou]', text)

Out[151]:

['W', "'", 'r', ' ', '@', 'c', 'l', 'b', '_', '0', '0', '7']

Extended Regular Expressions:¶

In [152]:

#   \d -> Any digit, equivalent to [0-9]
#   \D -> Any non-digit, equivalent to [^0-9]
#   \w -> Any alphanumeric, equivalent to [a-zA-Z0-9_]
#   \W -> Non-alphanumeric, equivalent to [^a-zA-Z0-9_]
#   \s -> Any whitespace character
#   \S -> Any nonwhitespace character
#   
#   () -> Scoping for extraction
#   {} -> Frequency for extraction
#   ?  -> Make a pattern non-greedy

1) \d — any digit, 0 to 9.¶

In [153]:

text = "We're @club_199"
re.findall('\d', text)

Out[153]:

['1', '9', '9']

2) \D — Any non-digit. It negates \d.¶

In [154]:

text = "We're @club_199"
print(re.findall('\D', text))

['W', 'e', "'", 'r', 'e', ' ', '@', 'c', 'l', 'u', 'b', '_']

3) '\w' (lowercase w) — Any alphanumeric character (letter, digit, or underscore).¶

In [155]:

text = "We're @club_199"
print(re.findall('\w', text))

['W', 'e', 'r', 'e', 'c', 'l', 'u', 'b', '_', '1', '9', '9']

4) '\W' (uppercase w) — anything that is not \w such as spaces, and special characters.¶

In [156]:

text = "We're @club_199"
print(re.findall('\W', text))

["'", ' ', '@']

5) '\s' (lowercase s) — A white space.¶

In [157]:

text = "We're @club_007"
print(re.findall('\s', text))

[' ']

6) '\S' (uppercase s) — Negates \s. Returns anything that is not a white space.¶

In [158]:

text = "We're @club_007"
print(re.findall('\S', text))

['W', 'e', "'", 'r', 'e', '@', 'c', 'l', 'u', 'b', '_', '0', '0', '7']

7) () — When you write a regex pattern, you can define groups using parentheses.¶

In [159]:

# Below examples makes the role of '.' more clear. 
# when we are looking for .o. we want to find '0' with at least char before that and one after that. 
# That char can be space as well.

In [160]:

text = 'Once Yvonne worked for vOn on a project'
pattern = '(.o.)'
re.findall(pattern, text,flags = re.IGNORECASE)

Out[160]:

['von', 'wor', 'for', 'vOn', ' on', 'roj']

8) {n} — Defines the exact number of times to match the previous character or pattern. e.g ‘d{3}’ matches ‘ddd’.¶

In [161]:

text = '7.6% in 2020 now 2022/23 budget, 99999'
re.findall('\d{4}', text)

Out[161]:

['2020', '2022', '9999']

If we give {m,n} numbers example {1,4}, it checks all 1,2,3 and 4 instances

In [162]:

text = '7.6% in 2020 now 2022/23 budget, 99999'
re.findall('\d{1,4}', text)

Out[162]:

['7', '6', '2020', '2022', '23', '9999', '9']

Regular expressions are handy when searching and cleaning text-based columns in Pandas.¶

Pandas contains several functions that support pattern-matching with regex, similar to the "re" library¶

Following are the three major functions that support regex :¶

1) .str.replace(pattern, repl, flag)¶

2) .str.findall(pattern, flags)¶

3) .str.contains(pattern)¶

In [163]:

import pandas as pd

df = pd.DataFrame(
        data= [
    ['@mlplus', 'We are excited to launch our new Product. #newcourse #machinelearning #python','ml.advance@mlplus.tech', '$1100,000*'],
    ['@kaustubhgupta', "@gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail",'kaustubh@edu.in', '0 paid'],
    ['@rajveer', 'Excited to lauch our new product! #newproduct #startup ','a.gupta9@twitter.me', 'less than $1000'],
    ['@jimmy', 'When will this coronavirus end? #thoughts','jimmy.987654321@facebook.pl', 'Amount of $8795'],
    ['@abhishek', 'I want to become data scientist. Any suggestions? @MLOps @dlml','abhishek@orkut.tech', '$700.50 total'],
    ['@ayushi', 'Missing foundational Course! @colllege','ayushi@university.org', '$40000 string created' ]
               ],
        
    columns=['username', 'tweet', 'email', 'Amount'])

pd.set_option("max_colwidth", None)
df

Out[163]:

	username	tweet	email	Amount
0	@mlplus	We are excited to launch our new Product. #newcourse #machinelearning #python	ml.advance@mlplus.tech	$1100,000*
1	@kaustubhgupta	@gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail	kaustubh@edu.in	0 paid
2	@rajveer	Excited to lauch our new product! #newproduct #startup	a.gupta9@twitter.me	less than $1000
3	@jimmy	When will this coronavirus end? #thoughts	jimmy.987654321@facebook.pl	Amount of $8795
4	@abhishek	I want to become data scientist. Any suggestions? @MLOps @dlml	abhishek@orkut.tech	$700.50 total
5	@ayushi	Missing foundational Course! @colllege	ayushi@university.org	$40000 string created

Removing hashtags using regex replace¶

In below example: \w -> Any alphanumeric, equivalent to [a-zA-Z0-9_] + -> matches one or more occourrence of the previous charecter

In [164]:

# using replace function with regex pattern, 
#       regex=True and 
#       value as empty string
df["tweets_without_#"] = df.tweet.replace(to_replace='#\w+', regex=True, value='')
df[["tweet", "tweets_without_#"]]

Out[164]:

	tweet	tweets_without_#
0	We are excited to launch our new Product. #newcourse #machinelearning #python	We are excited to launch our new Product.
1	@gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail	@gmail Gmail is down for 30 minutes. What's the matter?
2	Excited to lauch our new product! #newproduct #startup	Excited to lauch our new product!
3	When will this coronavirus end? #thoughts	When will this coronavirus end?
4	I want to become data scientist. Any suggestions? @MLOps @dlml	I want to become data scientist. Any suggestions? @MLOps @dlml
5	Missing foundational Course! @colllege	Missing foundational Course! @colllege

Replacing all domain suffixes with .edu using regex¶

In [165]:

df['email_with_edu'] = df.email.replace(to_replace=r'\.\w+', value='.edu', regex=True)
df[['email', 'email_with_edu']]

Out[165]:

	email	email_with_edu
0	ml.advance@mlplus.tech	ml.edu@mlplus.edu
1	kaustubh@edu.in	kaustubh@edu.edu
2	a.gupta9@twitter.me	a.edu@twitter.edu
3	jimmy.987654321@facebook.pl	jimmy.edu@facebook.edu
4	abhishek@orkut.tech	abhishek@orkut.edu
5	ayushi@university.org	ayushi@university.edu

You should notice that in first email ID, and few other as well, the email IDs also got changed. So let's improve it. so we are now asking for @ followed by alphanumeric

In [166]:

df['email_with_edu'] = df.email.replace(to_replace= r'\.\w+$', value='.edu', regex=True)
df[['email', 'email_with_edu']]

Out[166]:

	email	email_with_edu
0	ml.advance@mlplus.tech	ml.advance@mlplus.edu
1	kaustubh@edu.in	kaustubh@edu.edu
2	a.gupta9@twitter.me	a.gupta9@twitter.edu
3	jimmy.987654321@facebook.pl	jimmy.987654321@facebook.edu
4	abhishek@orkut.tech	abhishek@orkut.edu
5	ayushi@university.org	ayushi@university.edu

Replace all the vowels in tweets with $¶

In [167]:

df['email_new'] = df.email.replace(to_replace= "[aeiouAEIOU]", regex=True, value='%')
df[['email', 'email_new']]

Out[167]:

	email	email_new
0	ml.advance@mlplus.tech	ml.%dv%nc%@mlpl%s.t%ch
1	kaustubh@edu.in	k%%st%bh@%d%.%n
2	a.gupta9@twitter.me	%.g%pt%9@tw%tt%r.m%
3	jimmy.987654321@facebook.pl	j%mmy.987654321@f%c%b%%k.pl
4	abhishek@orkut.tech	%bh%sh%k@%rk%t.t%ch
5	ayushi@university.org	%y%sh%@%n%v%rs%ty.%rg

Replace the numbers in tweets with text 00number00 using replace function and regex expressions¶

In [168]:

df['Amount_mask'] = df.Amount.replace(to_replace="\d+", regex=True, value='00number00')
df[['Amount', 'Amount_mask']]

Out[168]:

	Amount	Amount_mask
0	$1100,000*	$00number00,00number00*
1	0 paid	00number00 paid
2	less than $1000	less than $00number00
3	Amount of $8795	Amount of $00number00
4	$700.50 total	$00number00.00number00 total
5	$40000 string created	$00number00 string created

Using regex comparison to extraxct numbers only¶

In [169]:

df['Amount_extracted'] = df['Amount'].str.replace(r'\D+', '', regex=True).astype('int')
df[['Amount', 'Amount_extracted']]

Out[169]:

	Amount	Amount_extracted
0	$1100,000*	1100000
1	0 paid	0
2	less than $1000	1000
3	Amount of $8795	8795
4	$700.50 total	70050
5	$40000 string created	40000

Extract all #Tag(s) only with findall and regular expression¶

In [170]:

df['tweet_words'] = df.tweet.str.findall(r"#(\w+)")
df[['tweet', 'tweet_words']]

Out[170]:

	tweet	tweet_words
0	We are excited to launch our new Product. #newcourse #machinelearning #python	[newcourse, machinelearning, python]
1	@gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail	[gmaildown, google, gmail]
2	Excited to lauch our new product! #newproduct #startup	[newproduct, startup]
3	When will this coronavirus end? #thoughts	[thoughts]
4	I want to become data scientist. Any suggestions? @MLOps @dlml	[]
5	Missing foundational Course! @colllege	[]

Extract all character only with findall and regular expression¶

In [171]:

df['tweet_words'] = df.tweet.str.findall('[a-zA-Z]+')
df[['tweet', 'tweet_words']]

Out[171]:

	tweet	tweet_words
0	We are excited to launch our new Product. #newcourse #machinelearning #python	[We, are, excited, to, launch, our, new, Product, newcourse, machinelearning, python]
1	@gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail	[gmail, Gmail, is, down, for, minutes, What, s, the, matter, gmaildown, google, gmail]
2	Excited to lauch our new product! #newproduct #startup	[Excited, to, lauch, our, new, product, newproduct, startup]
3	When will this coronavirus end? #thoughts	[When, will, this, coronavirus, end, thoughts]
4	I want to become data scientist. Any suggestions? @MLOps @dlml	[I, want, to, become, data, scientist, Any, suggestions, MLOps, dlml]
5	Missing foundational Course! @colllege	[Missing, foundational, Course, colllege]

Extract all non-character only with findall and regular expression¶

In [172]:

df['tweet_words'] = df.tweet.str.findall('[^a-zA-Z]+')
df[['tweet', 'tweet_words']]

Out[172]:

	tweet	tweet_words
0	We are excited to launch our new Product. #newcourse #machinelearning #python	[ , , , , , , , . #, #, #]
1	@gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail	[@, , , , , 30 , . , ', , , ? #, #, #]
2	Excited to lauch our new product! #newproduct #startup	[ , , , , , ! #, #, ]
3	When will this coronavirus end? #thoughts	[ , , , , ? #]
4	I want to become data scientist. Any suggestions? @MLOps @dlml	[ , , , , , . , , ? @, @]
5	Missing foundational Course! @colllege	[ , , ! @]

Verify a set of strings with-in text¶

In [173]:

df['email_new'] = df['email'].str.contains(r'@?.tech')
df[['email', 'email_new']]

Out[173]:

	email	email_new
0	ml.advance@mlplus.tech	True
1	kaustubh@edu.in	False
2	a.gupta9@twitter.me	False
3	jimmy.987654321@facebook.pl	False
4	abhishek@orkut.tech	True
5	ayushi@university.org	False

A simple project using Regex¶

Email verification with Regex¶

In [174]:

data = {'Customer_id': ['cus_01', 'cus_02', 'cus_03', 'cus_04'], 
        'email': ['name.surname@gmail.com','anonymous123@yahoo.co.ukk','anonymous123@.com','a..@gmail.com']
        } 
df = pd.DataFrame(data)
df

Out[174]:

	Customer_id	email
0	cus_01	name.surname@gmail.com
1	cus_02	anonymous123@yahoo.co.ukk
2	cus_03	anonymous123@.com
3	cus_04	a..@gmail.com

In [175]:

regex_regular = re.compile(r'\w+[._-]*\w*@\w+\.[\w\.]{2,7}')

In [176]:

# How to read above regular expression? Here you go:
# (r' 
# \w+              -> Atleast one or more Alphanumeric values should be there
# [._-]*           -->  Zero ore more . - or _ should be there
# \w*              -->  above two can follow by one or zero more alphanumeric
# @                --> Now after above strings an '@' should be there
# \w+              --> Atleast one or more Alphanumeric values should be there
# \.               --> Now a period should be there
# [\w\.]{2,7}      --> then should be followed by a Alphnumeric string of 2 to 7 length
# ')

re.fullmatch(pattern, string, flags). This method returns a match object only if the whole string matches the pattern¶

In [177]:

def isValid_regular(x):
    if re.fullmatch(regex_regular, x):
        y = 'Valid email'
    else:
        y = 'Invalid email'
    return y

In [178]:

df['Is_ Valid_regular'] = df['email'].apply(isValid_regular)
df

Out[178]:

	Customer_id	email	Is_ Valid_regular
0	cus_01	name.surname@gmail.com	Valid email
1	cus_02	anonymous123@yahoo.co.ukk	Valid email
2	cus_03	anonymous123@.com	Invalid email
3	cus_04	a..@gmail.com	Valid email

In [179]:

regex_gmail = re.compile(r'\w+[._-]*\w*@gmail.com')

In [180]:

# How to read above regular expression? Here you go:

# (r'
# \w+              -> Atleast one or more Alphanumeric values should be there
# [._-]*           -->  Zero ore more . - or _ should be there
# \w*              -->  above two can follow by one or zero more alphanumeric
# @gmail\.com      --> above all stuff followed by a '@gmail.com' will make it a valid gmail id, right?
# ')

In [181]:

def isValid_gmail(x):
    if re.fullmatch(regex_gmail, x):
        y = 'Valid email'
    else:
        y = 'Invalid email'
    return y

In [182]:

df['Is_ Valid_gmail'] = df['email'].apply(isValid_gmail)
df

Out[182]:

	Customer_id	email	Is_ Valid_regular	Is_ Valid_gmail
0	cus_01	name.surname@gmail.com	Valid email	Valid email
1	cus_02	anonymous123@yahoo.co.ukk	Valid email	Invalid email
2	cus_03	anonymous123@.com	Invalid email	Invalid email
3	cus_04	a..@gmail.com	Valid email	Valid email

Now we can break the string as well, using regular expressions¶

In [183]:

pattern = '(\w+[._-]*\w*)@(\w+)\.([\w\.]{2,7})'

In [184]:

text = 'rajat.sas@gmail.com'
y = re.findall(pattern, text,flags = re.I)[0][0]
y

Out[184]:

'rajat.sas'

In [185]:

def email_part1(x):
    try:
        pattern = '(\w+[._-]*\w*)@(\w+)\.([\w\.]{2,7})'
        y = re.findall(pattern, x,flags = re.I)[0][0]
    except:
        y = 'Something went wrong'
    return y

def email_domain(x):
    try:
        pattern = '(\w+[._-]*\w*)@(\w+)\.([\w\.]{2,7})'
        y = re.findall(pattern, x,flags = re.I)[0][1]
    except:
        y = 'Something went wrong'
    return y

print(email_part1(x = 'rajat.sas@gmail.com'), "\n", email_domain(x = 'rajat.sas@gmail.com'))

rajat.sas 
 gmail

In [186]:

df['email_ID'] = df['email'].apply(email_part1)
df['Email_Domain'] = df['email'].apply(email_domain)
df

Out[186]:

	Customer_id	email	Is_ Valid_regular	Is_ Valid_gmail	email_ID	Email_Domain
0	cus_01	name.surname@gmail.com	Valid email	Valid email	name.surname	gmail
1	cus_02	anonymous123@yahoo.co.ukk	Valid email	Invalid email	anonymous123	yahoo
2	cus_03	anonymous123@.com	Invalid email	Invalid email	Something went wrong	Something went wrong
3	cus_04	a..@gmail.com	Valid email	Valid email	a..	gmail

In [187]:

def email_part1(x):
    pattern = '(\w+[._-]*\w*)@(\w+)\.([\w\.]{2,7})' 
    y = re.findall(pattern, x,flags = re.I)[0][0]
    return y

email_part1(x = 'rajat.sas@gmail.com')

Out[187]:

'rajat.sas'

Don't worry!!! We will explain "apply" and 'def' stuff later in a separate blog, when we will cover UDF - User Defined Functions¶

Module 8 : Handling the Date Time¶

Unix epoch (00:00:00 UTC on 1 January 1970)¶

It is similar to base date is SAS

https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

In [188]:

import pandas as pd

epoch_time = [0, 1, 
              60, 3600, 
              7200, 3600*24]

# Pandas pd.to_datetime() is able to convert any valid date string to datetime
pd.to_datetime(epoch_time, unit = 's')

Out[188]:

DatetimeIndex(['1970-01-01 00:00:00', '1970-01-01 00:00:01',
               '1970-01-01 00:01:00', '1970-01-01 01:00:00',
               '1970-01-01 02:00:00', '1970-01-02 00:00:00'],
              dtype='datetime64[ns]', freq=None)

The [ns] means the nano second-based time format that specifies the precision of the DateTime object.

In [189]:

# 1 second = 1000 milliseconds
epoch_time = 1001
pd.to_datetime(epoch_time, unit = 'ms')

Out[189]:

Timestamp('1970-01-01 00:00:01.001000')

In [190]:

import pandas as pd

epoch_time = [0, 1, 
              60, 3600, 
              7200, 3600*24]

# Pandas pd.to_datetime() is able to convert any valid date string to datetime
pd.to_datetime(epoch_time, unit = 's')

Out[190]:

DatetimeIndex(['1970-01-01 00:00:00', '1970-01-01 00:00:01',
               '1970-01-01 00:01:00', '1970-01-01 01:00:00',
               '1970-01-01 02:00:00', '1970-01-02 00:00:00'],
              dtype='datetime64[ns]', freq=None)

The [ns] means the nano second-based time format that specifies the precision of the DateTime object.

In [191]:

# 1 second = 1000 milliseconds
epoch_time = 1001
pd.to_datetime(epoch_time, unit = 'ms')

Out[191]:

Timestamp('1970-01-01 00:00:01.001000')

The unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or float number. Default is ‘ns’¶

D - Date s - second ms - milli-second us - ns - nano-second

In [192]:

dob = [-1 ,0, 1, 364, 10000, 19400]
pd.to_datetime(dob, unit='D')

Out[192]:

DatetimeIndex(['1969-12-31', '1970-01-01', '1970-01-02', '1970-12-31',
               '1997-05-19', '2023-02-12'],
              dtype='datetime64[ns]', freq=None)

In [193]:

# Create a list of dates with different formats
doj = ['2023-03-25', 'Feb 3 2023','2023.02.03', '2023/02/03', '20230203', '02/03/2023', '03/02/2023']

In [194]:

pd.to_datetime(doj,infer_datetime_format=True)

Out[194]:

DatetimeIndex(['2023-03-25', '2023-02-03', '2023-02-03', '2023-02-03',
               '2023-02-03', '2023-02-03', '2023-03-02'],
              dtype='datetime64[ns]', freq=None)

please note - infer_datetime_format=True takes first non-null values as reference to best guess rest of the values

Customize date format¶

In [195]:

# The format parameter will instruct Pandas how to interpret your strings when converting them to DateTime objects.
pd.to_datetime('2022~05~15', format = '%Y~%m~%d')

Out[195]:

Timestamp('2022-05-15 00:00:00')

In [196]:

pd.to_datetime('15/05/2022', format = '%d/%m/%Y')

Out[196]:

Timestamp('2022-05-15 00:00:00')

DateTime Format Codes¶

In [197]:

#Invalid dates
#pd.to_datetime(['2-13-2023', 'Feb 13', '2023', '123'])

In [198]:

pd.to_datetime(['2-13-2023', 'Feb 13', '2023', '123'], errors='ignore')

Out[198]:

Index(['2-13-2023', 'Feb 13', '2023', '123'], dtype='object')

In [199]:

pd.to_datetime(['2-13-2023', 'Feb 13', '2023', '123'], errors='coerce')

Out[199]:

DatetimeIndex(['2023-02-13', 'NaT', '2023-01-01', 'NaT'], dtype='datetime64[ns]', freq=None)

In [200]:

# importing data
import pandas as pd
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df = df[["user_id", "purchase_value", "signup_time", "dob"]]
df.head(2)

Out[200]:

	user_id	purchase_value	signup_time	dob
0	22058	65278	2/24/2015 22:55	2/22/1976
1	333320	96399	6/7/2015 20:39	1/2/1962

In [201]:

# Check the data-types 
df.dtypes

Out[201]:

user_id            int64
purchase_value     int64
signup_time       object
dob               object
dtype: object

Pandas pd.to_datetime() is able to parse any valid date string to datetime¶

In [202]:

df["signup_time"] = pd.to_datetime(df['signup_time'])
df["dob"] = pd.to_datetime(df['dob'] )
df.dtypes

Out[202]:

user_id                    int64
purchase_value             int64
signup_time       datetime64[ns]
dob               datetime64[ns]
dtype: object

In [203]:

df.head(2)

Out[203]:

	user_id	purchase_value	signup_time	dob
0	22058	65278	2015-02-24 22:55:00	1976-02-22
1	333320	96399	2015-06-07 20:39:00	1962-01-02

How to change the format of the date¶

In [204]:

df['dob_new'] = df.dob.dt.strftime('%d/%m/%Y')
df.head(2)

Out[204]:

	user_id	purchase_value	signup_time	dob	dob_new
0	22058	65278	2015-02-24 22:55:00	1976-02-22	22/02/1976
1	333320	96399	2015-06-07 20:39:00	1962-01-02	02/01/1962

In [205]:

df.dtypes

Out[205]:

user_id                    int64
purchase_value             int64
signup_time       datetime64[ns]
dob               datetime64[ns]
dob_new                   object
dtype: object

start and end date of the time series data¶

In [206]:

print("Minimum of dob :" , df.dob.min(), "\n"
      "Maximum of dob :", df.dob.max(), "\n"
      "Difference of min and max :", df.dob.max() - df.dob.min())

Minimum of dob : 1946-06-03 00:00:00 
Maximum of dob : 1997-10-14 00:00:00 
Difference of min and max : 18761 days 00:00:00

Get year, month, and day¶

In [207]:

df['signup_year'] = df['signup_time'].dt.year
df['signup_month'] = df['signup_time'].dt.month
df['signup_date'] = df['signup_time'].dt.month
df.head()

Out[207]:

	user_id	purchase_value	signup_time	dob	dob_new	signup_year	signup_month	signup_date
0	22058	65278	2015-02-24 22:55:00	1976-02-22	22/02/1976	2015	2	2
1	333320	96399	2015-06-07 20:39:00	1962-01-02	02/01/1962	2015	6	6
2	1359	57296	2015-01-01 18:52:00	1962-03-05	05/03/1962	2015	1	1
3	150084	43650	2015-04-28 21:13:00	1974-07-03	03/07/1974	2015	4	4
4	221365	45016	2015-07-21 07:09:00	1970-08-25	25/08/1970	2015	7	7

Get the week of year, the day of week and leap year¶

In [208]:

df['signup_week_of_year'] = df['signup_time'].dt.isocalendar().week
df['signup_day_of_week'] = df['signup_time'].dt.dayofweek
df['signup_is_leap_year'] = df['signup_time'].dt.is_leap_year
df.head()

Out[208]:

	user_id	purchase_value	signup_time	dob	dob_new	signup_year	signup_month	signup_date	signup_week_of_year	signup_day_of_week	signup_is_leap_year
0	22058	65278	2015-02-24 22:55:00	1976-02-22	22/02/1976	2015	2	2	9	1	False
1	333320	96399	2015-06-07 20:39:00	1962-01-02	02/01/1962	2015	6	6	23	6	False
2	1359	57296	2015-01-01 18:52:00	1962-03-05	05/03/1962	2015	1	1	1	3	False
3	150084	43650	2015-04-28 21:13:00	1974-07-03	03/07/1974	2015	4	4	18	1	False
4	221365	45016	2015-07-21 07:09:00	1970-08-25	25/08/1970	2015	7	7	30	1	False

In [209]:

df_1 = df.signup_time.apply(lambda x: x.day_name())
df_1.head()

Out[209]:

0     Tuesday
1      Sunday
2    Thursday
3     Tuesday
4     Tuesday
Name: signup_time, dtype: object

In [210]:

df_2 = df.signup_time.apply(lambda x: x.month_name())
df_2.head()

Out[210]:

0    February
1        June
2     January
3       April
4        July
Name: signup_time, dtype: object

In [211]:

# Monday is denoted by 0 and ends on Sunday which is denoted by 6

Here, Pandas dt.dayofweek returns the day of the week.¶

To replace the week-number with full name, we can create a mapping and pass it to map()¶

In [212]:

wd_mapping={
'Monday', 
'Tuesday', 
'Wednesday', 
'Thursday', 
'Friday',
'Saturday', 
'Sunday'
}

Get the week of year, the day of week and leap year¶

In [213]:

df['signup_week_of_year'] = df['signup_time'].dt.isocalendar().week
df['signup_day_of_week'] = df['signup_time'].dt.dayofweek
df['signup_is_leap_year'] = df['signup_time'].dt.is_leap_year
df.head()

Out[213]:

	user_id	purchase_value	signup_time	dob	dob_new	signup_year	signup_month	signup_date	signup_week_of_year	signup_day_of_week	signup_is_leap_year
0	22058	65278	2015-02-24 22:55:00	1976-02-22	22/02/1976	2015	2	2	9	1	False
1	333320	96399	2015-06-07 20:39:00	1962-01-02	02/01/1962	2015	6	6	23	6	False
2	1359	57296	2015-01-01 18:52:00	1962-03-05	05/03/1962	2015	1	1	1	3	False
3	150084	43650	2015-04-28 21:13:00	1974-07-03	03/07/1974	2015	4	4	18	1	False
4	221365	45016	2015-07-21 07:09:00	1970-08-25	25/08/1970	2015	7	7	30	1	False

In [214]:

df_1 = df.signup_time.apply(lambda x: x.day_name())
df_1.head()

Out[214]:

0     Tuesday
1      Sunday
2    Thursday
3     Tuesday
4     Tuesday
Name: signup_time, dtype: object

In [215]:

df_2 = df.signup_time.apply(lambda x: x.month_name())
df_2.head()

Out[215]:

0    February
1        June
2     January
3       April
4        July
Name: signup_time, dtype: object

In [216]:

# Monday is denoted by 0 and ends on Sunday which is denoted by 6

Here, Pandas dt.dayofweek returns the day of the week.¶

To replace the week-number with full name, we can create a mapping and pass it to map()¶

Doesn't it look very much like Proc Format of SAS?

In [217]:

wd_mapping={
'Monday', 
'Tuesday', 
'Wednesday', 
'Thursday', 
'Friday',
'Saturday', 
'Sunday'
}

In [218]:

df['signup_day_of_week']=df['signup_time'].dt.weekday.map(wd_mapping)
df.head()

Out[218]:

	user_id	purchase_value	signup_time	dob	dob_new	signup_year	signup_month	signup_date	signup_week_of_year	signup_day_of_week	signup_is_leap_year
0	22058	65278	2015-02-24 22:55:00	1976-02-22	22/02/1976	2015	2	2	9	Tuesday	False
1	333320	96399	2015-06-07 20:39:00	1962-01-02	02/01/1962	2015	6	6	23	Sunday	False
2	1359	57296	2015-01-01 18:52:00	1962-03-05	05/03/1962	2015	1	1	1	Thursday	False
3	150084	43650	2015-04-28 21:13:00	1974-07-03	03/07/1974	2015	4	4	18	Tuesday	False
4	221365	45016	2015-07-21 07:09:00	1970-08-25	25/08/1970	2015	7	7	30	Tuesday	False

Get the age from the date of birth with the help of pd.to_datetime('today')¶

In [219]:

today = pd.to_datetime('today')
df['age'] = today.year - df['dob'].dt.year
df.head()

Out[219]:

	user_id	purchase_value	signup_time	dob	dob_new	signup_year	signup_month	signup_date	signup_week_of_year	signup_day_of_week	signup_is_leap_year	age
0	22058	65278	2015-02-24 22:55:00	1976-02-22	22/02/1976	2015	2	2	9	Tuesday	False	47.0
1	333320	96399	2015-06-07 20:39:00	1962-01-02	02/01/1962	2015	6	6	23	Sunday	False	61.0
2	1359	57296	2015-01-01 18:52:00	1962-03-05	05/03/1962	2015	1	1	1	Thursday	False	61.0
3	150084	43650	2015-04-28 21:13:00	1974-07-03	03/07/1974	2015	4	4	18	Tuesday	False	49.0
4	221365	45016	2015-07-21 07:09:00	1970-08-25	25/08/1970	2015	7	7	30	Tuesday	False	53.0

Difference between two dates¶

In [220]:

#Age at the time of signup
df['age_at_signup'] = df['signup_time'].dt.year - df['dob'].dt.year
df.head()

Out[220]:

	user_id	purchase_value	signup_time	dob	dob_new	signup_year	signup_month	signup_date	signup_week_of_year	signup_day_of_week	signup_is_leap_year	age	age_at_signup
0	22058	65278	2015-02-24 22:55:00	1976-02-22	22/02/1976	2015	2	2	9	Tuesday	False	47.0	39.0
1	333320	96399	2015-06-07 20:39:00	1962-01-02	02/01/1962	2015	6	6	23	Sunday	False	61.0	53.0
2	1359	57296	2015-01-01 18:52:00	1962-03-05	05/03/1962	2015	1	1	1	Thursday	False	61.0	53.0
3	150084	43650	2015-04-28 21:13:00	1974-07-03	03/07/1974	2015	4	4	18	Tuesday	False	49.0	41.0
4	221365	45016	2015-07-21 07:09:00	1970-08-25	25/08/1970	2015	7	7	30	Tuesday	False	53.0	45.0

NumPy allows the subtraction of two datetime values, an operation which produces a number with a time unit.¶

The arguments for timedelta64 are a number, to represent the number of units, and a date/time unit, such as (D)ay, (M)onth, (Y)ear, (h)ours, (m)inutes, or (s)econds.¶

In [221]:

import numpy as np
#create new columns that contains date differences
df['age_at_signup_days'] = (df['signup_time'] - df['dob']) / np.timedelta64(1, 'D')
df['age_at_signup_weeks'] = (df['signup_time'] - df['dob']) / np.timedelta64(1, 'W')
df['age_at_signup_months'] = (df['signup_time'] - df['dob']) / np.timedelta64(1, 'M')
df['age_at_signup_years'] = (df['signup_time'] - df['dob']) / np.timedelta64(1, 'Y')
df.head()

Out[221]:

	user_id	purchase_value	signup_time	dob	dob_new	signup_year	signup_month	signup_date	signup_week_of_year	signup_day_of_week	signup_is_leap_year	age	age_at_signup	age_at_signup_days	age_at_signup_weeks	age_at_signup_months	age_at_signup_years
0	22058	65278	2015-02-24 22:55:00	1976-02-22	22/02/1976	2015	2	2	9	Tuesday	False	47.0	39.0	14247.954861	2035.422123	468.114905	39.009575
1	333320	96399	2015-06-07 20:39:00	1962-01-02	02/01/1962	2015	6	6	23	Sunday	False	61.0	53.0	19514.860417	2787.837202	641.158477	53.429873
2	1359	57296	2015-01-01 18:52:00	1962-03-05	05/03/1962	2015	1	1	1	Thursday	False	61.0	53.0	19295.786111	2756.540873	633.960816	52.830068
3	150084	43650	2015-04-28 21:13:00	1974-07-03	03/07/1974	2015	4	4	18	Tuesday	False	49.0	41.0	14909.884028	2129.983433	489.862511	40.821876
4	221365	45016	2015-07-21 07:09:00	1970-08-25	25/08/1970	2015	7	7	30	Tuesday	False	53.0	45.0	16401.297917	2343.042560	538.862742	44.905228

Get the datepart and timepart from datetime¶

In [222]:

df['signup_datepart'] = df['signup_time'].dt.date
df['signup_timepart'] = df['signup_time'].dt.time

df[['signup_time', 'signup_datepart', 'signup_timepart']] .head()

Out[222]:

	signup_time	signup_datepart	signup_timepart
0	2015-02-24 22:55:00	2015-02-24	22:55:00
1	2015-06-07 20:39:00	2015-06-07	20:39:00
2	2015-01-01 18:52:00	2015-01-01	18:52:00
3	2015-04-28 21:13:00	2015-04-28	21:13:00
4	2015-07-21 07:09:00	2015-07-21	07:09:00

set the date column as the index¶

In [223]:

df = df.set_index(['signup_time'])
df.head()
#Reset index to default
#df = df.reset_index(level=0)

Out[223]:

	user_id	purchase_value	dob	dob_new	signup_year	signup_month	signup_date	signup_week_of_year	signup_day_of_week	signup_is_leap_year	age	age_at_signup	age_at_signup_days	age_at_signup_weeks	age_at_signup_months	age_at_signup_years	signup_datepart	signup_timepart
signup_time
2015-02-24 22:55:00	22058	65278	1976-02-22	22/02/1976	2015	2	2	9	Tuesday	False	47.0	39.0	14247.954861	2035.422123	468.114905	39.009575	2015-02-24	22:55:00
2015-06-07 20:39:00	333320	96399	1962-01-02	02/01/1962	2015	6	6	23	Sunday	False	61.0	53.0	19514.860417	2787.837202	641.158477	53.429873	2015-06-07	20:39:00
2015-01-01 18:52:00	1359	57296	1962-03-05	05/03/1962	2015	1	1	1	Thursday	False	61.0	53.0	19295.786111	2756.540873	633.960816	52.830068	2015-01-01	18:52:00
2015-04-28 21:13:00	150084	43650	1974-07-03	03/07/1974	2015	4	4	18	Tuesday	False	49.0	41.0	14909.884028	2129.983433	489.862511	40.821876	2015-04-28	21:13:00
2015-07-21 07:09:00	221365	45016	1970-08-25	25/08/1970	2015	7	7	30	Tuesday	False	53.0	45.0	16401.297917	2343.042560	538.862742	44.905228	2015-07-21	07:09:00

In [224]:

df = df.reset_index(level=0)
df.head()

Out[224]:

	signup_time	user_id	purchase_value	dob	dob_new	signup_year	signup_month	signup_date	signup_week_of_year	signup_day_of_week	signup_is_leap_year	age	age_at_signup	age_at_signup_days	age_at_signup_weeks	age_at_signup_months	age_at_signup_years	signup_datepart	signup_timepart
0	2015-02-24 22:55:00	22058	65278	1976-02-22	22/02/1976	2015	2	2	9	Tuesday	False	47.0	39.0	14247.954861	2035.422123	468.114905	39.009575	2015-02-24	22:55:00
1	2015-06-07 20:39:00	333320	96399	1962-01-02	02/01/1962	2015	6	6	23	Sunday	False	61.0	53.0	19514.860417	2787.837202	641.158477	53.429873	2015-06-07	20:39:00
2	2015-01-01 18:52:00	1359	57296	1962-03-05	05/03/1962	2015	1	1	1	Thursday	False	61.0	53.0	19295.786111	2756.540873	633.960816	52.830068	2015-01-01	18:52:00
3	2015-04-28 21:13:00	150084	43650	1974-07-03	03/07/1974	2015	4	4	18	Tuesday	False	49.0	41.0	14909.884028	2129.983433	489.862511	40.821876	2015-04-28	21:13:00
4	2015-07-21 07:09:00	221365	45016	1970-08-25	25/08/1970	2015	7	7	30	Tuesday	False	53.0	45.0	16401.297917	2343.042560	538.862742	44.905228	2015-07-21	07:09:00

Subset data between two dates¶

In [225]:

df["signup_time_new"]  = pd.to_datetime(df['signup_time'])
df.dtypes

Out[225]:

signup_time             datetime64[ns]
user_id                          int64
purchase_value                   int64
dob                     datetime64[ns]
dob_new                         object
signup_year                      int64
signup_month                     int64
signup_date                      int64
signup_week_of_year             UInt32
signup_day_of_week              object
signup_is_leap_year               bool
age                            float64
age_at_signup                  float64
age_at_signup_days             float64
age_at_signup_weeks            float64
age_at_signup_months           float64
age_at_signup_years            float64
signup_datepart                 object
signup_timepart                 object
signup_time_new         datetime64[ns]
dtype: object

In [226]:

subset = df[(df['signup_time_new'] >= '2015-02-15') & (df['signup_time_new']  <= '2015-02-28')]
subset.head()

Out[226]:

	signup_time	user_id	purchase_value	dob	dob_new	signup_year	signup_month	signup_date	signup_week_of_year	signup_day_of_week	signup_is_leap_year	age	age_at_signup	age_at_signup_days	age_at_signup_weeks	age_at_signup_months	age_at_signup_years	signup_datepart	signup_timepart	signup_time_new
0	2015-02-24 22:55:00	22058	65278	1976-02-22	22/02/1976	2015	2	2	9	Tuesday	False	47.0	39.0	14247.954861	2035.422123	468.114905	39.009575	2015-02-24	22:55:00	2015-02-24 22:55:00
14	2015-02-16 02:50:00	82931	71616	1991-10-21	21/10/1991	2015	2	2	8	Monday	False	32.0	24.0	8519.118056	1217.016865	279.894636	23.324553	2015-02-16	02:50:00	2015-02-16 02:50:00
43	2015-02-19 12:31:00	250432	59053	1975-04-06	06/04/1975	2015	2	2	8	Thursday	False	48.0	40.0	14564.521528	2080.645933	478.515667	39.876306	2015-02-19	12:31:00	2015-02-19 12:31:00
61	2015-02-20 18:29:00	221091	86357	NaT	NaN	2015	2	2	8	Friday	False	NaN	NaN	NaN	NaN	NaN	NaN	2015-02-20	18:29:00	2015-02-20 18:29:00
102	2015-02-16 03:40:00	259616	98184	1992-07-09	09/07/1992	2015	2	2	8	Monday	False	31.0	23.0	8257.152778	1179.593254	271.287797	22.607316	2015-02-16	03:40:00	2015-02-16 03:40:00

Aggregate over indexed datetime column¶

In [227]:

# Get the total of purchase_value in 2015
df[df['signup_year'] == 2015].purchase_value.sum()

Out[227]:

50941499

In [228]:

df.groupby('signup_day_of_week')['purchase_value'].sum()

Out[228]:

signup_day_of_week
Friday       8514166
Monday       7088526
Saturday     6432936
Sunday       8482116
Thursday     6621175
Tuesday      6403681
Wednesday    7398899
Name: purchase_value, dtype: int64

The date-time default format is “YYYY-MM-DD”¶

.dt.strftime() can change the date format in python.¶

In [229]:

df.dob = df.dob.dt.strftime('%d/%m/%Y')

In [230]:

df.dtypes

Out[230]:

signup_time             datetime64[ns]
user_id                          int64
purchase_value                   int64
dob                             object
dob_new                         object
signup_year                      int64
signup_month                     int64
signup_date                      int64
signup_week_of_year             UInt32
signup_day_of_week              object
signup_is_leap_year               bool
age                            float64
age_at_signup                  float64
age_at_signup_days             float64
age_at_signup_weeks            float64
age_at_signup_months           float64
age_at_signup_years            float64
signup_datepart                 object
signup_timepart                 object
signup_time_new         datetime64[ns]
dtype: object

In [231]:

df.head(2)

Out[231]:

	signup_time	user_id	purchase_value	dob	dob_new	signup_year	signup_month	signup_date	signup_week_of_year	signup_day_of_week	signup_is_leap_year	age	age_at_signup	age_at_signup_days	age_at_signup_weeks	age_at_signup_months	age_at_signup_years	signup_datepart	signup_timepart	signup_time_new
0	2015-02-24 22:55:00	22058	65278	22/02/1976	22/02/1976	2015	2	2	9	Tuesday	False	47.0	39.0	14247.954861	2035.422123	468.114905	39.009575	2015-02-24	22:55:00	2015-02-24 22:55:00
1	2015-06-07 20:39:00	333320	96399	02/01/1962	02/01/1962	2015	6	6	23	Sunday	False	61.0	53.0	19514.860417	2787.837202	641.158477	53.429873	2015-06-07	20:39:00	2015-06-07 20:39:00

In [232]:

#Reset index to default
df = df.reset_index(level=0)
df.head()

Out[232]:

	index	signup_time	user_id	purchase_value	dob	dob_new	signup_year	signup_month	signup_date	signup_week_of_year	...	signup_is_leap_year	age	age_at_signup	age_at_signup_days	age_at_signup_weeks	age_at_signup_months	age_at_signup_years	signup_datepart	signup_timepart	signup_time_new
0	0	2015-02-24 22:55:00	22058	65278	22/02/1976	22/02/1976	2015	2	2	9	...	False	47.0	39.0	14247.954861	2035.422123	468.114905	39.009575	2015-02-24	22:55:00	2015-02-24 22:55:00
1	1	2015-06-07 20:39:00	333320	96399	02/01/1962	02/01/1962	2015	6	6	23	...	False	61.0	53.0	19514.860417	2787.837202	641.158477	53.429873	2015-06-07	20:39:00	2015-06-07 20:39:00
2	2	2015-01-01 18:52:00	1359	57296	05/03/1962	05/03/1962	2015	1	1	1	...	False	61.0	53.0	19295.786111	2756.540873	633.960816	52.830068	2015-01-01	18:52:00	2015-01-01 18:52:00
3	3	2015-04-28 21:13:00	150084	43650	03/07/1974	03/07/1974	2015	4	4	18	...	False	49.0	41.0	14909.884028	2129.983433	489.862511	40.821876	2015-04-28	21:13:00	2015-04-28 21:13:00
4	4	2015-07-21 07:09:00	221365	45016	25/08/1970	25/08/1970	2015	7	7	30	...	False	53.0	45.0	16401.297917	2343.042560	538.862742	44.905228	2015-07-21	07:09:00	2015-07-21 07:09:00

5 rows × 21 columns

DataFrame.DateOffset()¶

In [233]:

# Adding or subtracting days
df['signup_time_end']=df['signup_time']+pd.DateOffset(days=365)
df[['signup_time' , 'signup_time_end']].head()

Out[233]:

	signup_time	signup_time_end
0	2015-02-24 22:55:00	2016-02-24 22:55:00
1	2015-06-07 20:39:00	2016-06-06 20:39:00
2	2015-01-01 18:52:00	2016-01-01 18:52:00
3	2015-04-28 21:13:00	2016-04-27 21:13:00
4	2015-07-21 07:09:00	2016-07-20 07:09:00

In [234]:

# Adding Year(s)
df['signup_time_end_2']=df['signup_time']+pd.DateOffset(years=2)
df[['signup_time' , 'signup_time_end_2']].head()

Out[234]:

	signup_time	signup_time_end_2
0	2015-02-24 22:55:00	2017-02-24 22:55:00
1	2015-06-07 20:39:00	2017-06-07 20:39:00
2	2015-01-01 18:52:00	2017-01-01 18:52:00
3	2015-04-28 21:13:00	2017-04-28 21:13:00
4	2015-07-21 07:09:00	2017-07-21 07:09:00

Start/End of previous quarter/month/year¶

Let's start learning few more basics, which are used quite often during automation

In [235]:

import calendar
from datetime import datetime, date
from dateutil.relativedelta import relativedelta

In [236]:

first_day, last_day = calendar.monthrange(2022,2)
print(first_day,last_day)

1 28

In [237]:

sample_date  = date.today()
sample_date

Out[237]:

datetime.date(2023, 4, 6)

In [238]:

# you can use today or a refernece date
CMB = sample_date + relativedelta(day = 1)
CMB

Out[238]:

datetime.date(2023, 4, 1)

In [239]:

CME = sample_date + relativedelta(day = 31)
CME

Out[239]:

datetime.date(2023, 4, 30)

In [240]:

# Please note: use'months' not 'month'
PMB = (sample_date + relativedelta(months = -1))+ relativedelta(day = 1)
PMB

Out[240]:

datetime.date(2023, 3, 1)

In [241]:

PME = (sample_date + relativedelta(months = -1))+ relativedelta(day = 31)
PME

Out[241]:

datetime.date(2023, 3, 31)

In [242]:

# Please note: use'years' not 'year'
PYB  = date((sample_date + relativedelta(years = -1)).year,1,1)
PYB

Out[242]:

datetime.date(2022, 1, 1)

In [243]:

PYE = date((sample_date + relativedelta(years = -1)).year,12,31)
PYE

Out[243]:

datetime.date(2022, 12, 31)

#Supoose you want to move the date by n quarters We haven't found a user friendly function to have that yet, so we developed for ourself

In [244]:

def PQB(ref_date,n):
    PREVIOUS_QTR_STG1 = (sample_date + relativedelta(months = 1*n*3))+ relativedelta(day = 1)
    PREVIOUS_QTR_STG1_Month = PREVIOUS_QTR_STG1.month
    PREVIOUS_QTR_STG1_year = PREVIOUS_QTR_STG1.year
    
    if PREVIOUS_QTR_STG1_Month in (1,2,3):
        x = 1
    elif PREVIOUS_QTR_STG1_Month in (4,5,6):
        x = 4
    elif PREVIOUS_QTR_STG1_Month in (7,8,9):
        x = 7
    elif PREVIOUS_QTR_STG1_Month in (10,11,12):
        x = 10    
        
    PREVIOUS_QTR_STG2 = date(PREVIOUS_QTR_STG1_year,x,1)
    return PREVIOUS_QTR_STG2

In [245]:

n = -2
sample_date = date(2023,3,10)
PQB(sample_date,n)

Out[245]:

datetime.date(2022, 7, 1)

In [246]:

def PQE(ref_date,n):
    PREVIOUS_QTR_STG1 = (sample_date + relativedelta(months = 1*n*3))+ relativedelta(day = 1)
    PREVIOUS_QTR_STG1_Month = PREVIOUS_QTR_STG1.month
    PREVIOUS_QTR_STG1_year = PREVIOUS_QTR_STG1.year
    
    if PREVIOUS_QTR_STG1_Month in (1,2,3):
        x = 3
    elif PREVIOUS_QTR_STG1_Month in (4,5,6):
        x = 6
    elif PREVIOUS_QTR_STG1_Month in (7,8,9):
        x = 9
    elif PREVIOUS_QTR_STG1_Month in (10,11,12):
        x = 12    
    PREVIOUS_QTR_STG2 = date(PREVIOUS_QTR_STG1_year,x,1)+ relativedelta(day = 31)
    return PREVIOUS_QTR_STG2

In [247]:

n = -2
sample_date = date(2023,3,10)
PQE(sample_date,n)

Out[247]:

datetime.date(2022, 9, 30)

Let's try to change the time-zones of the date fields¶

In [248]:

import pandas as pd
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv", usecols = ['user_id','signup_time'])
df["signup_time"] = pd.to_datetime(df['signup_time'])
df.head(5)

Out[248]:

	user_id	signup_time
0	22058	2015-02-24 22:55:00
1	333320	2015-06-07 20:39:00
2	1359	2015-01-01 18:52:00
3	150084	2015-04-28 21:13:00
4	221365	2015-07-21 07:09:00

In [249]:

# set the current time as of Asia
df['EST_timezone'] = df.signup_time.dt.tz_localize('EST')
df['MST_timezone'] = df.EST_timezone.dt.tz_convert('MST')
df['IST_timezone'] = df.EST_timezone.dt.tz_convert('Asia/Calcutta')
df.head(5)

Out[249]:

	user_id	signup_time	EST_timezone	MST_timezone	IST_timezone
0	22058	2015-02-24 22:55:00	2015-02-24 22:55:00-05:00	2015-02-24 20:55:00-07:00	2015-02-25 09:25:00+05:30
1	333320	2015-06-07 20:39:00	2015-06-07 20:39:00-05:00	2015-06-07 18:39:00-07:00	2015-06-08 07:09:00+05:30
2	1359	2015-01-01 18:52:00	2015-01-01 18:52:00-05:00	2015-01-01 16:52:00-07:00	2015-01-02 05:22:00+05:30
3	150084	2015-04-28 21:13:00	2015-04-28 21:13:00-05:00	2015-04-28 19:13:00-07:00	2015-04-29 07:43:00+05:30
4	221365	2015-07-21 07:09:00	2015-07-21 07:09:00-05:00	2015-07-21 05:09:00-07:00	2015-07-21 17:39:00+05:30

Module 9 : Arithmetic Operations on Pandas DataFrame¶

In [250]:

#importing a csv file with pd.read_csv
import pandas as pd
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df.head(2)

Out[250]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
0	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	732758368.8	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com
1	333320	6/7/2015 20:39	6/8/2015 1:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	350311387.9	0	apparels	1/2/1962	Rick D Rothackerj	rick_rothackerj@yahoo.com

Create a New Columns with Arithmetic Operators : + - * /¶

In [251]:

discount = 0.21
df['discount'] = df['purchase_value'] * discount
df['actual_purchase_value'] = df['purchase_value'] - df['discount']
df[['purchase_value', 'discount', 'actual_purchase_value']].head()

Out[251]:

	purchase_value	discount	actual_purchase_value
0	65278	13708.38	51569.62
1	96399	20243.79	76155.21
2	57296	12032.16	45263.84
3	43650	9166.50	34483.50
4	45016	9453.36	35562.64

Create lags and leads of a column in a pandas DataFrame¶

In [252]:

#importing a simple csv file with pd.read_csv
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv",usecols = ['user_id','purchase_value','purchase_time'])
df.sort_values(['user_id'],ascending= [True],inplace=True,ignore_index=True)
df.head()

Out[252]:

	user_id	purchase_time	purchase_value
0	170	8/26/2015 21:25	39879
1	753	5/25/2015 21:28	67119
2	1359	1/1/2015 18:52	57296
3	1361	7/29/2015 0:53	73482
4	2048	4/12/2015 12:23	30576

Lag shifts a column down by a certain number with .shift()¶

In [253]:

df['lag_1_user_id'] = df['user_id'].shift(1)
df['lag_2_user_id'] = df['user_id'].shift(2)
df[['user_id', 'lag_1_user_id', 'lag_2_user_id']].head()

Out[253]:

	user_id	lag_1_user_id	lag_2_user_id
0	170	NaN	NaN
1	753	170.0	NaN
2	1359	753.0	170.0
3	1361	1359.0	753.0
4	2048	1361.0	1359.0

lead shifts a column up by a certain number with .shift()¶

In [254]:

df['lead_1_user_id'] = df['user_id'].shift(-1)
df['lead_2_user_id'] = df['user_id'].shift(-2)
df[['user_id', 'lead_1_user_id', 'lead_2_user_id']].head()

Out[254]:

	user_id	lead_1_user_id	lead_2_user_id
0	170	753.0	1359.0
1	753	1359.0	1361.0
2	1359	1361.0	2048.0
3	1361	2048.0	2462.0
4	2048	2462.0	2474.0

Ceil and floor values¶

Ceil - It rounds up the value to the nearest greater integer.¶

Floor - It rounds up the value to the nearest smaller integer.¶

with numpy library : .apply(np.ceil) AND .apply(np.floor)¶

In [255]:

import numpy as np
df['purchase_value_USD'] = df['purchase_value']/82
df['purchase_value_USD_Ceil'] = df['purchase_value'].apply(np.ceil)
df['purchase_value_USD_Floor'] = df['purchase_value'].apply(np.floor)
df[['purchase_value_USD', 'purchase_value_USD_Ceil', 'purchase_value_USD_Floor']].head()

Out[255]:

	purchase_value_USD	purchase_value_USD_Ceil	purchase_value_USD_Floor
0	486.329268	39879.0	39879.0
1	818.524390	67119.0	67119.0
2	698.731707	57296.0	57296.0
3	896.121951	73482.0	73482.0
4	372.878049	30576.0	30576.0

with math library - .apply(math.ceil) AND .apply(math.floor)¶

In [256]:

import math
df['purchase_value_USD'] = df['purchase_value']/82
df['purchase_value_USD_Ceil'] = df['purchase_value_USD'].apply(math.ceil)
df['purchase_value_USD_Floor'] = df['purchase_value_USD'].apply(math.floor)

df[["purchase_value_USD", "purchase_value_USD_Ceil", "purchase_value_USD_Floor"]].head()

Out[256]:

	purchase_value_USD	purchase_value_USD_Ceil	purchase_value_USD_Floor
0	486.329268	487	486
1	818.524390	819	818
2	698.731707	699	698
3	896.121951	897	896
4	372.878049	373	372

Rounds off the value according to their decimal digits¶

round() as a function¶

In [257]:

df['rounded_off'] = round(df['purchase_value_USD'])
df['rounded_off_single_decimal'] = round(df['purchase_value_USD'],1)
df['rounded_off_double_decimal'] = round(df['purchase_value_USD'],2)

df[['purchase_value_USD', 'rounded_off', 'rounded_off_single_decimal', 'rounded_off_double_decimal']].head()

Out[257]:

	purchase_value_USD	rounded_off	rounded_off_single_decimal	rounded_off_double_decimal
0	486.329268	486.0	486.3	486.33
1	818.524390	819.0	818.5	818.52
2	698.731707	699.0	698.7	698.73
3	896.121951	896.0	896.1	896.12
4	372.878049	373.0	372.9	372.88

.round(decimal=) as a method¶

In [258]:

df['rounded_off_single_decimal'] = df['purchase_value_USD'].round(decimals=1)
df['rounded_off_double_decimal'] = df['purchase_value_USD'].round(decimals=2)
df[['purchase_value_USD', 'rounded_off_double_decimal', 'rounded_off_double_decimal']].head()

Out[258]:

	purchase_value_USD	rounded_off_double_decimal	rounded_off_double_decimal
0	486.329268	486.33	486.33
1	818.524390	818.52	818.52
2	698.731707	698.73	698.73
3	896.121951	896.12	896.12
4	372.878049	372.88	372.88

.mod() function is used to calculate modulo of the series when divided by a given scalar value¶

In simpler terms, If we divide a column with certain number, let's suppose 100, then what's the remainder?¶

In [259]:

df["Remainder"] = df['purchase_value_USD'].mod(100)
df[["purchase_value_USD", "Remainder"]].head()

Out[259]:

	purchase_value_USD	Remainder
0	486.329268	86.329268
1	818.524390	18.524390
2	698.731707	98.731707
3	896.121951	96.121951
4	372.878049	72.878049

log() Functions to Calculate Logarithm¶

math.log - log base e¶

math.log2 - log base 2¶

math.log10 - log base 10¶

In [260]:

df['log_purchase_value'] = df['purchase_value'].apply(math.log)
df['log2_purchase_value'] = df['purchase_value'].apply(math.log2)
df['log10_purchase_value'] = df['purchase_value'].apply(math.log10)
df[['purchase_value', 'log_purchase_value', 'log2_purchase_value', 'log10_purchase_value']].head()

Out[260]:

	purchase_value	log_purchase_value	log2_purchase_value	log10_purchase_value
0	39879	10.593605	15.283342	4.600744
1	67119	11.114222	16.034434	4.826845
2	57296	10.955986	15.806147	4.758124
3	73482	11.204796	16.165103	4.866181
4	30576	10.327971	14.900112	4.485381

Module 10 : query() Method for Pandas DataFrame¶

It allows you to query the DataFrame¶

Takes a query expression as a string parameter, which has to evaluate to either True of False.¶

Syntax : dataframe.query(expr, inplace)¶

(inplace = True) : Make the changes in the original DataFrame (inplace = False) : It specifies if the query() method should leave the original DataFrame untouched. This is default

In [261]:

#importing a csv file with pd.read_csv
import pandas as pd
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df.head(3)

Out[261]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
0	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	7.327584e+08	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com
1	333320	6/7/2015 20:39	6/8/2015 1:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	3.503114e+08	0	apparels	1/2/1962	Rick D Rothackerj	rick_rothackerj@yahoo.com
2	1359	1/1/2015 18:52	1/1/2015 18:52	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	3/5/1962	Harriet M McLeodd	harriet.mcleodd@gmail.com

In [262]:

df.query("category == 'electronics'").head(2)

Out[262]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
2	1359	1/1/2015 18:52	1/1/2015 18:52	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	3/5/1962	Harriet M McLeodd	harriet.mcleodd@gmail.com
5	159135	5/21/2015 6:03	7/9/2015 8:05	37124	ALEYXFXINSXLZ	Ads	Chrome	M	2.809315e+09	0	electronics	4/8/1997	Lee Z Chyenz	lee#chyenz@gmail.com

Multiple conditions with and , or¶

In [263]:

df.query("category == 'electronics' and purchase_value >= 50000").head()

Out[263]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
2	1359	1/1/2015 18:52	1/1/2015 18:52	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	3/5/1962	Harriet M McLeodd	harriet.mcleodd@gmail.com
6	50116	8/1/2015 22:40	8/27/2015 3:37	75000	IWKVZHJOCLPUR	Ads	Chrome	F	3.987484e+09	0	electronics	5/1/1996	Dhanya J Skariachano	dhanya.skariachano@gmail.com
31	396746	8/5/2015 23:50	8/14/2015 12:08	83290	JUWCOHMHRBDCL	SEO	IE	M	2.937614e+09	0	electronics	8/16/1959	Steve U Slatery	steve.slatery@gmail.com
32	243817	2/13/2015 18:16	4/23/2015 17:44	64864	YIIEFBZUGEXNR	Direct	Chrome	F	2.521871e+09	0	electronics	2/8/1983	ita D Bosed	ita.bosed@yahoo.com
42	55002	6/9/2015 0:32	9/17/2015 13:45	85743	QLHWTRKJZWMAW	SEO	FireFox	M	3.184465e+09	0	electronics	8/1/1996	Jessica I Toonkelx	jessica.toonkelx@sutletgroup.com

In [264]:

df.query("category == 'electronics' or purchase_value >= 50000").head()

Out[264]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
0	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	7.327584e+08	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com
1	333320	6/7/2015 20:39	6/8/2015 1:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	3.503114e+08	0	apparels	1/2/1962	Rick D Rothackerj	rick_rothackerj@yahoo.com
2	1359	1/1/2015 18:52	1/1/2015 18:52	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	3/5/1962	Harriet M McLeodd	harriet.mcleodd@gmail.com
5	159135	5/21/2015 6:03	7/9/2015 8:05	37124	ALEYXFXINSXLZ	Ads	Chrome	M	2.809315e+09	0	electronics	4/8/1997	Lee Z Chyenz	lee#chyenz@gmail.com
6	50116	8/1/2015 22:40	8/27/2015 3:37	75000	IWKVZHJOCLPUR	Ads	Chrome	F	3.987484e+09	0	electronics	5/1/1996	Dhanya J Skariachano	dhanya.skariachano@gmail.com

In [265]:

df.query("category in ['electronics','apparels']").head()

Out[265]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
1	333320	6/7/2015 20:39	6/8/2015 1:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	3.503114e+08	0	apparels	1/2/1962	Rick D Rothackerj	rick_rothackerj@yahoo.com
2	1359	1/1/2015 18:52	1/1/2015 18:52	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	3/5/1962	Harriet M McLeodd	harriet.mcleodd@gmail.com
5	159135	5/21/2015 6:03	7/9/2015 8:05	37124	ALEYXFXINSXLZ	Ads	Chrome	M	2.809315e+09	0	electronics	4/8/1997	Lee Z Chyenz	lee#chyenz@gmail.com
6	50116	8/1/2015 22:40	8/27/2015 3:37	75000	IWKVZHJOCLPUR	Ads	Chrome	F	3.987484e+09	0	electronics	5/1/1996	Dhanya J Skariachano	dhanya.skariachano@gmail.com
8	159045	4/21/2015 23:38	6/2/2015 14:01	66311	ILXYDOZIHOOHT	SEO	IE	F	3.719094e+09	0	apparels	1/22/1972	Andreas B Cremerf	andreas.cremerf@sutletgroup.com

In [266]:

# Query between a range of values
df.query("purchase_value >= 80000 and purchase_value <= 90000").head(2)

Out[266]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
13	299320	3/3/2015 19:17	4/5/2015 12:32	84732	RMKQNVEWGTWPC	Direct	Safari	M	3.178510e+09	0	cosmetics	10/15/1977	Rachelle K Younglaic	rachelle.younglaic@tntech.edu
15	31383	2/1/2015 1:06	3/24/2015 10:17	84796	UNUAVQXNWFBMO	SEO	Safari	F	9.957328e+08	0	home_essentials	8/26/1991	Nate F Raymondj	nate.raymondj@gmail.com

Substr method alternative¶

check the value of the second lowercased character of a string.¶

In [267]:

df.query('name.str[1] == "a"').head(2)
# df.query('name.str[1].str.lower() == "a"').head(2)

Out[267]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
0	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	7.327584e+08	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com
2	1359	1/1/2015 18:52	1/1/2015 18:52	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	3/5/1962	Harriet M McLeodd	harriet.mcleodd@gmail.com

.str.slice() to check the portion with-in a text column¶

In [268]:

df.query('name.str.slice(1,4).str.lower() == "arr"').head(2)

Out[268]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
2	1359	1/1/2015 18:52	1/1/2015 18:52	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	3/5/1962	Harriet M McLeodd	harriet.mcleodd@gmail.com
62	222633	4/18/2015 14:33	6/3/2015 0:43	35386	CYULQPBDHLIBY	SEO	FireFox	M	1.438733e+09	0	home_essentials	8/4/1974	Warren B Strobels	warren.strobels@yahoo.com

List of Operators which can be used : '==' , '!=', '>' , '>=', '<' , '<=', isna(), in(), notin()¶

Module 11 : Running SQL Queries on Pandas DataFrame - PANDASQL¶

We can query pandas DataFrame directly using SQL queries¶

In [269]:

#!pip install pandasql

In [270]:

import pandas as pd
import numpy as np
import pandasql as ps

In [271]:

df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan, 5230],
                   [1234, 'Customer A', np.nan, '333 Street',8345],
                   [1233, 'Customer B', '444 Street', '333 Street', 3456],
                   [1233, 'Customer B', '444 Street', '666 Street', 4578]], 
        columns =  ['ID', 'Customer', 'Billing Address', 'Shipping Address', 'Payment'])

use the 'sqldf' function of pandasql('ps') library - ps.sqldf¶

In [272]:

ps.sqldf('''
            select * from df limit 2
        ''')

Out[272]:

	ID	Customer	Billing Address	Shipping Address	Payment
0	1234	Customer A	123 Street	None	5230
1	1234	Customer A	None	333 Street	8345

In [273]:

ps.sqldf('''select distinct ID from df''')

Out[273]:

	ID
0	1234
1	1233

SQL GROUP BY¶

In [274]:

ps.sqldf('''
            select ID, sum(Payment)  
            from df 
            group by ID
            
            ''')

Out[274]:

	ID	sum(Payment)
0	1233	8034
1	1234	13575

SQL Order BY¶

In [275]:

ps.sqldf('''select ID, sum(Payment) as Total_Payment 
            from df 
            group by ID
            order by Total_Payment ''')

Out[275]:

	ID	Total_Payment
0	1233	8034
1	1234	13575

SQL having Clause¶

In [276]:

ps.sqldf('''
select ID, sum(Payment) as Total_Payment 
            from df 
            group by ID
            having Total_Payment > 10000
            
            ''')

Out[276]:

	ID	Total_Payment
0	1234	13575

Create a DataFrmae from sql query¶

In [277]:

df_id = ps.sqldf('''select distinct id, Customer from df''')
df_id

Out[277]:

	ID	Customer
0	1234	Customer A
1	1233	Customer B

Module 11 : Python to Oracle, SQL Server, MySQL, and PostgreSQL¶

1) Connection to Oracle and fetch the first 10 rows from the famous employees table of HR schema¶

Note: how Oracle uses ROWNUM, instead of TOP, or LIMIT

In [278]:

#install cx_Oracle library through pip
#! pip install cx_Oracle

In [279]:

# import cx_Oracle
# dsn = cx_Oracle.makedsn(
#     'localhost', 
#     '1521', 
#     service_name='orcl'
# )
# conn = cx_Oracle.connect(
#     user='username', 
#     password='password', 
#     dsn=dsn
# )
# c = conn.cursor()
# c.execute('SELECT * FROM employees WHERE ROWNUM <= 10')
# for row in c: print(row)
# conn.close()

In [280]:

# Note: The makedsn() function accepts the database hostname, the port number, and the service name.
#The cx_Oracle.makedsn() function can be used to construct a connect descriptor string for 
# the data source name parameter 'dsn' of cx_Oracle.connect()

connect(): Now Establish a connection between the Python program and Oracle database by using connect() function.¶

cursor(): To execute a SQL query and to provide results some special object is required that is nothing but cursor() object.¶

cursor.execute('sqlquery') - - - -> to execute a SQL query.¶

commit(): For DML(Data Manipulation Language) queries that comprise operations like update, insert, delete. We need to commit() then only the result reflects in the database.¶

close(): After all done it is mandatory to close all operations.¶

2) SQL Server Connection¶

In [281]:

#install cx_Oracle library through pip
#! pip install pyodbc

In [282]:

# import pyodbc
# conn = pyodbc.connect(
#     'Driver={SQL Server};'
#     'Server=DESKTOP-TLF7IMQ\SQLEXPRESS;'
#     'Database=retail;'
#     'Trusted_Connection=yes;'
# )
# cursor = conn.cursor()
# cursor.execute('SELECT TOP 5 * FROM databasename.tablename')
# for row in cursor: print(row)
# conn.close()

3) MySQL Connection¶

In [283]:

# Python installed through Anaconda already come with necessary libraries built-in, 
# so there’s no need for additional installations. 
# To connect from Python, use mysql library:

In [284]:

# import mysql.connector
# conn = mysql.connector.connect(
#     host='localhost',
#     user='root',
#     passwd='1234'
# )
# cursor = conn.cursor()
# cursor.execute('SELECT * FROM sakila.actor LIMIT 5')
# for row in cursor: print(row)
# conn.close()

In [285]:

#Note: By default, MySQL instance comes with sakila database

4) PostgreSQL Connection¶

To connect to PostgreSQL we will need to install psycopg2 library:

In [286]:

#!pip install psycopg2

In [287]:

# The database instance contains the earlier mentioned dvdrental database, 
# so let's connect to it, once again to the actor table:

In [288]:

# import psycopg2
# conn = psycopg2.connect(
#     user='postgres',
#     password='1234',
#     host='127.0.0.1',
#     port='5432',
#     database='dvdrental'
# )
# cursor = conn.cursor()
# cursor.execute('SELECT * FROM actor LIMIT 10')
# for row in cursor: print(row)
# conn.close()

Module 12 : User-Defined Functions (UDF) and Lambda Functions¶

Defining a function¶

it is very much like defining Macros in SAS¶

In [289]:

def multiply():
    a = 2
    b = 3
    x = a * b
    return x

Calling a function¶

In [290]:

multiply()

Out[290]:

UDF with Parameters/Arguments¶

Example1:¶

In [291]:

def multiply(a, b):
    return(a*b)

In [292]:

multiply(125, 7)

Out[292]:

Example2: A funtion to get (x+y)^2¶

In [293]:

def nsquare(x, y):
    a = (x+y)**2
    return a

In [294]:

nsquare(2,3)

Out[294]:

Example3: A function to calculate the area of a circle¶

In [295]:

def calculate_area(radius):
    pi = 22/7
    area = pi * radius * radius
    return area

In [296]:

calculate_area(9)

Out[296]:

254.57142857142856

Example 4: A function to check if a number is even¶

In [297]:

def is_even(number):
    if number%2 == 0:
        return True
    else:
        return False

In [298]:

is_even(34)

Out[298]:

True

Example 5: A function to calculate the sum of a list of numbers¶

In [299]:

def calculate_sum(numbers):
    total = 0
    for number in numbers:
        total = total + number
    return total

This function takes a list of numbers as input, iterates over each number in the list, adds it to a running total, and returns the final sum.

In [300]:

listing = [10,20,30,40,50,100]
calculate_sum(listing)

Out[300]:

Types of arguments¶

1). Positional Arguments – These arguments are position specific. Thus, while calling a a function always pass the arguments in the same order as it is defined while creating the function.

Example 1: Calculating Simple Interest¶

In [301]:

def simple_interest(principal,time,rate):
    interest=(principal * time * rate)/100
    amount = interest + principal
    return interest, amount

In [302]:

simple_interest(100000,3,12)

Out[302]:

(36000.0, 136000.0)

Example 2: Calcuating Compound Interest and Maturity Amount¶

In [303]:

def compound_interest(principal, rate, time):
    Amount = principal
    for i in range(time):
        Amount = Amount * (1 + rate/100)
    CI = Amount - principal
    return CI , Amount

In [304]:

compound_interest(100000,3,12)

Out[304]:

(42576.08868461792, 142576.08868461792)

Example 3: Create a function to calculate simple interest over a period of n year and put it in a pandas dataframe.¶

In [305]:

def calculate_simple_interest(principal, rate, n):
    # Calculate the interest rate per year
    interest_rate = rate / 100
    
    # Calculate the simple interest for each year
    interest_list = []
    for year in range(1, n+1):
        interest = principal * interest_rate * year
        interest_list.append(interest)
    
    # Create a pandas DataFrame with the results
    df = pd.DataFrame({
        'Year': range(1, n+1),
        'Interest': interest_list
    })
    
    return df

In [306]:

df = calculate_simple_interest(100000, 7.5, 4)
print(df)

   Year  Interest
0     1    7500.0
1     2   15000.0
2     3   22500.0
3     4   30000.0

2). Default Arguments – In some cases we have a default value that can be used only if the value for that parameter is not passed in the function calling statement. Default parameters must be present after the non-default parameters in the function header otherwise it will give an error.

In [307]:

def simple_interest(principal, time, rate=12):
    interest = (principal * time * rate)/100
    amount = interest + principal
    return interest, amount

In [308]:

simple_interest(100000, 3)

Out[308]:

(36000.0, 136000.0)

3). Keywords Arguments – These arguments are name(keyword) specific. Here, in the calling functions we can use the exact name of the parameters to call the function and it need not be in the same order as it is defined while creating the function.

In [309]:

def simple_interest(principal, time, rate):
    interest = (principal * time * rate)/100
    amount = interest + principal
    return interest, amount

In [310]:

simple_interest(time=3, rate=12, principal=100000)

Out[310]:

(36000.0, 136000.0)

4). Variable-length arguments – This type of argument allows us to pass any length of argument in the calling statement of the function according to our needs so there is no need to mention a fixed number of parameters in the function header.

In [311]:

def summation(*length):
    total=0
    for i in length:
        total+=i
    print("Total=",total)

In [312]:

summation(12, 13, 20)

Total= 45

In [313]:

summation(45,87,78,34,56,19,201)

Total= 520

Some more examples¶

In [314]:

#1. Write a function that inputs a number and prints the multiplication table of that number

In [315]:

def multiplication_table(number):
    for i in range(1, 11):
        print("{multiplier} * {multiplicand} = {multiplicantion}"
              .format(multiplier=number, multiplicand=i, multiplicantion=number * i))

In [316]:

multiplication_table(13)

13 * 1 = 13
13 * 2 = 26
13 * 3 = 39
13 * 4 = 52
13 * 5 = 65
13 * 6 = 78
13 * 7 = 91
13 * 8 = 104
13 * 9 = 117
13 * 10 = 130

'Applying' User defined function(UDF) in pandas dataframes¶

In [317]:

#importing a csv file with pd.read_csv
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df.head(2)

Out[317]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
0	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	732758368.8	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com
1	333320	6/7/2015 20:39	6/8/2015 1:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	350311387.9	0	apparels	1/2/1962	Rick D Rothackerj	rick_rothackerj@yahoo.com

Example 1: create a new categorical(string) variable from a continous(numeric) variable¶

In [318]:

def fun(x):
    if x <= 25000:
        x = 'Below Normal'
    elif x <= 50000:
        x = 'Normal'
    elif x <= 75000:
        x = 'Above Normal'
    else:
        x = 'High'
    return x

In [319]:

df['purchase_value_category'] = df['purchase_value'].apply(fun)
df[["purchase_value", "purchase_value_category"]].head()

Out[319]:

	purchase_value	purchase_value_category
0	65278	Above Normal
1	96399	High
2	57296	Above Normal
3	43650	Normal
4	45016	Normal

In [320]:

df['purchase_value_category'].value_counts()

Out[320]:

Above Normal    274
High            257
Below Normal    240
Normal          228
Name: purchase_value_category, dtype: int64

Example 2: Import multiple csv files in pandas datafram using user defined function¶

Create a function that takes a list of file names as an argument and uses the pandas read_csv() function to read each CSV file and concatenate them into a single DataFrame.

In [321]:

def concatenate_csv_files(file_list):
    # Initialize an empty list to store the DataFrames
    df_list = []
    
    # Loop through each file in the list and read it into a DataFrame
    for file in file_list:
        df = pd.read_csv(file)
        df_list.append(df)
    
    # Concatenate the DataFrames into a single DataFrame
    concatenated_df = pd.concat(df_list, ignore_index=True)
    
    return concatenated_df

In this function, we loop over the list of file names and use pd.read_csv() to read each CSV file and create a DataFrame. We then append the DataFrame to a list. Finally, we use pd.concat() to concatenate all the DataFrames in the list into a single DataFrame.

Note: The "axis" parameter of "pd.concat()" is set to "0", which means that the DataFrames are concatenated vertically (i.e., row-wise). If you want to concatenate them horizontally (i.e., column-wise), you can set axis to "1".

We can also modify this function to include additional parameters that are passed to "pd.read_csv()", such as delimiter, header, etc., depending on your specific requirements.

In [322]:

file_list = ['files_3_Concat/file1.csv', 'files_3_Concat/file2.csv', 'files_3_Concat/file3.csv']
df = concatenate_csv_files(file_list)
df.head()

Out[322]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob
0	22058	24-02-2015 22:55	18-04-2015 02:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	7.327584e+08	0	home_essentials	22-02-1976
1	333320	07-06-2015 20:39	08-06-2015 01:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	3.503114e+08	0	apparels	02-01-1962
2	1359	01-01-2015 18:52	01-01-2015 18:52	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	05-03-1962
3	150084	28-04-2015 21:13	04-05-2015 13:54	43650	ATGTXKYKUDUQN	SEO	Safari	M	3.840542e+09	0	health_care	03-07-1974
4	221365	21-07-2015 07:09	09-09-2015 18:40	45016	NAUITBZFJKHWW	Ads	Safari	M	4.155831e+08	0	home_essentials	25-08-1970

In [323]:

def get_csv_file_lengths(file_list):
    # Initialize an empty list to store the (file_name, num_rows) tuples
    lengths_list = []
    
    # Loop through each file in the list and get the number of rows
    for file_name in file_list:
        df = pd.read_csv(file_name)
        num_rows = len(df)
        lengths_list.append((file_name, num_rows))
    
    return lengths_list

In [324]:

get_csv_file_lengths(file_list)

Out[324]:

[('files_3_Concat/file1.csv', 471),
 ('files_3_Concat/file2.csv', 580),
 ('files_3_Concat/file3.csv', 439)]

In [325]:

df.head(2)

Out[325]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob
0	22058	24-02-2015 22:55	18-04-2015 02:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	7.327584e+08	0	home_essentials	22-02-1976
1	333320	07-06-2015 20:39	08-06-2015 01:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	3.503114e+08	0	apparels	02-01-1962

Lambda function in pandas¶

A lambda function in pandas is a small anonymous function that can be used to perform simple operations on pandas DataFrames or Series
The lambda function is defined using the lambda keyword, followed by the input parameters and the operation to be performed.
Lambda functions are typically used for one-off operations that do not need to be defined as a separate function.

Unlike user defined function which can we used time and again, it is for one time usage, like OTP (one time password)

Example 1: use of Lamda function to calculates the fee charged on each purchase value in df, and adds the fee values as a new column named "Fee" to the DataFrame¶

In [326]:

df['Fee'] = df['purchase_value']*(1-0.91)
df[['user_id', 'purchase_value', 'Fee']].head()

Out[326]:

	user_id	purchase_value	Fee
0	22058	65278	5875.02
1	333320	96399	8675.91
2	1359	57296	5156.64
3	150084	43650	3928.50
4	221365	45016	4051.44

In [327]:

df['Fee'] = df['purchase_value'].map(lambda x: x - (x*0.91))
df[['user_id', 'purchase_value', 'Fee']].head()

Out[327]:

	user_id	purchase_value	Fee
0	22058	65278	5875.02
1	333320	96399	8675.91
2	1359	57296	5156.64
3	150084	43650	3928.50
4	221365	45016	4051.44

.map(lambda x: x - (x*0.91)): This applies a function to each value in the "purchase_value" column. The function subtracts 91% of the value from the value itself, which gives the fee charged on each purchase. The result is a pandas Series containing the fee values.

Example 2: lambda function to create a new column called 'Bonus', which is 10% of each employee's salary.¶

In [328]:

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
                   'Age': [25, 30, 35],
                   'Salary': [50000, 70000, 90000]})
df

Out[328]:

	Name	Age	Salary
0	Alice	25	50000
1	Bob	30	70000
2	Charlie	35	90000

In [329]:

df['Bonus'] = df['Salary']*0.1
df

Out[329]:

	Name	Age	Salary	Bonus
0	Alice	25	50000	5000.0
1	Bob	30	70000	7000.0
2	Charlie	35	90000	9000.0

In [330]:

df['Bonus'] = df['Salary'].apply(lambda x: x * 0.1)
df

Out[330]:

	Name	Age	Salary	Bonus
0	Alice	25	50000	5000.0
1	Bob	30	70000	7000.0
2	Charlie	35	90000	9000.0

Here, we used the apply method to apply a lambda function to each row of the 'Salary' column. The lambda function takes the input parameter x, which represents the value of each element in the 'Salary' column. The apply method returns a new Series object, which we assign to the 'Bonus' column of the original DataFrame.

Example 3: lambda function with filter in pandas to filter rows based on a condition.¶

In [331]:

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
                   'Age': [25, 30, 35],
                   'Salary': [50000, 70000, 90000]})

# Use lambda function with filter to filter rows where Salary > 60000
filtered_df = df.loc[lambda df: df['Salary'] > 60000]
filtered_df

Out[331]:

	Name	Age	Salary
1	Bob	30	70000
2	Charlie	35	90000

Here, we used the .loc with a lambda function to apply the filter to the DataFrame.

Example 4: use groupby and then apply a lambda function to filter the grouped data.¶

In [332]:

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
                   'Department': ['Sales', 'Marketing', 'Sales', 'Marketing', 'Sales'],
                   'Salary': [50000, 70000, 90000, 60000, 80000],
                   'Bonus': [7000, 9000, 4000, 3000, 1000]})
df

# Use groupby with lambda to get average salary by department
avg_salary_by_dept = df.groupby('Department')[['Salary','Bonus']].apply(lambda x: x.mean())
avg_salary_by_dept

Out[332]:

	Salary	Bonus
Department
Marketing	65000.000000	6000.0
Sales	73333.333333	4000.0

Example 5: use multiple columns to groupby the data and apply a lambda function.¶

In [333]:

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
                   'Department': ['Sales', 'Marketing', 'Sales', 'Marketing', 'Sales'],
                   'Gender': ['F', 'M', 'M', 'M', 'F'],
                   'Salary': [50000, 70000, 90000, 60000, 80000]})

# Use groupby with lambda to get average salary by department and gender
avg_salary_by_dept_gender = df.groupby(['Department', 'Gender'])['Salary'].apply(lambda x: x.mean())
avg_salary_by_dept_gender

Out[333]:

Department  Gender
Marketing   M         65000.0
Sales       F         65000.0
            M         90000.0
Name: Salary, dtype: float64

Important - We can apply any function inside lambda, even user-defined function, and that makes it quite useful¶

In [334]:

def first_3(x):
    y = x[0:3]
    return y

df['Dept_short'] = df['Department'].apply(lambda x: first_3(x).upper())
df.head()

Out[334]:

	Name	Department	Gender	Salary	Dept_short
0	Alice	Sales	F	50000	SAL
1	Bob	Marketing	M	70000	MAR
2	Charlie	Sales	M	90000	SAL
3	David	Marketing	M	60000	MAR
4	Emily	Sales	F	80000	SAL

Example 6: use multiple if conditions in a lambda function in pandas by using the ; 'and' and 'or' logical operators.¶

In [335]:

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
                   'Department': ['Sales', 'Marketing', 'Sales', 'Marketing', 'Sales'],
                   'Salary': [50000, 70000, 90000, 60000, 80000]})

# Use lambda function with multiple if conditions to create a new column 'Bonus'
df['Bonus'] = df['Salary'].apply(lambda x: 500 if x > 70000 
                                              else 
                                            200 if x >= 60000 and x <= 70000 
                                            else 0)
df

Out[335]:

	Name	Department	Salary	Bonus
0	Alice	Sales	50000	0
1	Bob	Marketing	70000	200
2	Charlie	Sales	90000	500
3	David	Marketing	60000	200
4	Emily	Sales	80000	500

In Pandas map(), apply(), and applymap() are methods used to transform data¶

map() is used to transform values in a Series or a single column of a DataFrame. It takes a dictionary or a function as an argument and applies the function or mapping to the Series. The result is a new Series with the transformed values.

apply() is used to apply a function to each row or column of a DataFrame. It can be used to perform more complex transformations on the data. It can also be used to apply functions that take multiple arguments, by passing additional arguments as keyword arguments.

applymap() is used to apply a function element-wise to every element in a DataFrame. It is similar to map, but it applies the function to every element in the DataFrame instead of just a single column.

Example 1: Using map() to transform values in a Series¶

In [336]:

names = pd.Series(['Alice', 'Bob', 'Charlie'])

# using map to transform the values
first_three_letters = names.map(lambda x: x[0:3])
first_three_letters

Out[336]:

0    Ali
1    Bob
2    Cha
dtype: object

Example 2: Using apply() to apply a function to each row or column of a DataFrame¶

In [337]:

data = {
    'store1': [10, 20, 30],
    'store2': [5, 10, 15],
    'store3': [20, 30, 40],
    'store4': [40, 50, None]
}
df = pd.DataFrame(data, index=['product1', 'product2', 'product3'])
df

Out[337]:

	store1	store2	store3	store4
product1	10	5	20	40.0
product2	20	10	30	50.0
product3	30	15	40	NaN

In [338]:

# using apply to apply a function to each row, try with axis = 0 as well
df_applied= df.apply(lambda row: row.sum(), axis=1)
df_applied

Out[338]:

product1     75.0
product2    110.0
product3     85.0
dtype: float64

Example 3: Using applymap() to apply a function element-wise to every element in a DataFrame¶

In [339]:

# using applymap to apply a function to every element
df.applymap(lambda x: x ** 2)

Out[339]:

	store1	store2	store3	store4
product1	100	25	400	1600.0
product2	400	100	900	2500.0
product3	900	225	1600	NaN

The main differences between these methods are:

map() is used to transform values in a Series or a single column of a DataFrame.
apply() is used to apply a function to each OR 1 OR more than 1 row or column of a DataFrame.
applymap() is used to apply a function element-wise to every element in a DataFrame.

Module 13 : Data management using Python-Pandas¶

1. Merging Data¶

Merging data involves combining two or more data sets into a single data set based on common columns or indexes.¶

In [340]:

import pandas as pd

# create two data frames
df1 = pd.DataFrame({'id': ['1001', '1002', '1003'], 'name':   ['abc', 'mno', 'opq']})
df2 = pd.DataFrame({'id': ['1001', '1002', '1004'], 'salary': [1100.25, 2200.50, 3300.75]})

In [341]:

# merge the two data frames based 'on' the key column
merged_df = pd.merge(df1, df2, on='id')

# display the merged data frame
merged_df

Out[341]:

	id	name	salary
0	1001	abc	1100.25
1	1002	mno	2200.50

Note: The default is inner join only

Merging Data Frames with Different Merge Methods¶

Inner join or Inner Merge returns only the rows that have matching values in both data frames.¶

In [342]:

# merge the two data frames based on the id column using inner merge
merged_df = pd.merge(df1, df2, on='id', how='inner')
merged_df

Out[342]:

	id	name	salary
0	1001	abc	1100.25
1	1002	mno	2200.50

In the "pd.merge()" function, "on" and "how" are keywords used to specify how the merge should be performed.

"on" keyword is used to specify the column or index on which the merge should be performed

"how" keyword is used to specify the type of merge to perform.

Left Join returns all the rows from the left data frame and the matching rows from the right data frame, and fills in missing values with NaN.¶

In [343]:

# merge the data frames using a left join
merged_df = pd.merge(df1, df2, on='id', how='left')
merged_df

Out[343]:

	id	name	salary
0	1001	abc	1100.25
1	1002	mno	2200.50
2	1003	opq	NaN

Right Join is similar to a left join, but returns all the rows from the right data frame instead.¶

In [344]:

# merge the data frames using a right join
merged_df = pd.merge(df1, df2, on='id', how='right')
merged_df

Out[344]:

	id	name	salary
0	1001	abc	1100.25
1	1002	mno	2200.50
2	1004	NaN	3300.75

Outer Join returns all the rows from both data frames, and fills in missing values with NaN where there are no matches.¶

In [345]:

# merge the data frames using a outer join
merged_df = pd.merge(df1,df2, on='id', how='outer')
merged_df

Out[345]:

	id	name	salary
0	1001	abc	1100.25
1	1002	mno	2200.50
2	1003	opq	NaN
3	1004	NaN	3300.75

The "left_on" and "right_on" parameters are used to specify the names of the columns to be used for the merge operation from the left and right DataFrames respectively.¶¶

In [346]:

# create two data frames
df1 = pd.DataFrame({'id': ['1001', '1002', '1003'], 'name':   ['abc', 'mno', 'opq']})
df2 = pd.DataFrame({'customer_id': ['1001', '1002', '1004'], 'salary': [1100.25, 2200.50, 3300.75]})

In [347]:

merged_data = pd.merge(df1, df2, left_on='id', right_on='customer_id', how='outer')
merged_data

Out[347]:

	id	name	customer_id	salary
0	1001	abc	1001	1100.25
1	1002	mno	1002	2200.50
2	1003	opq	NaN	NaN
3	NaN	NaN	1004	3300.75

I hope you have noticed that forth ID in the last example has come black, so to avoid this issue, better we rename the ID variables and then merge¶

One of the most important thing in merge/join is - how it behaves with many to many condition. let's see:

In [348]:

import pandas as pd

# create two data frames
df1 = pd.DataFrame({'id': ['1001', '1001', '1003'], 'name':   ['abc', 'mno', 'opq']})
df2 = pd.DataFrame({'id': ['1001', '1001', '1004'], 'salary': [1100.25, 2200.50, 3300.75]})
# merge the two data frames based 'on' the key column
merged_df = pd.merge(df1, df2, on='id',how ='outer')

# display the merged data frame
merged_df

Out[348]:

	id	name	salary
0	1001	abc	1100.25
1	1001	abc	2200.50
2	1001	mno	1100.25
3	1001	mno	2200.50
4	1003	opq	NaN
5	1004	NaN	3300.75

So, basically it creates a CARETSIAN product, like SQL¶

join() function is used to join two DataFrames on their index.¶

The join() function is similar to the merge() function, but it is more convenient to use when the join key is the index of the DataFrames.¶

In [349]:

# Create two DataFrames
df1 = pd.DataFrame({'var1': [1, 2, 3], 'var2': [4, 5, 6]})
df2 = pd.DataFrame({'var3': [7, 8, 9], 'var4': [10, 11, 12]})

In [350]:

# Join the DataFrames on their index
result = df1.join(df2)
result

Out[350]:

	var1	var2	var3	var4
0	1	4	7	10
1	2	5	8	11
2	3	6	9	12

The join() function by default performs a left join, which means that it will keep all the rows of the left DataFrame and fill in missing values with NaN for any missing values in the right DataFrame.

In [351]:

# Create two DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [2, 3],    'D': [7, 8]})

# Join the DataFrames on their index
result = df1.join(df2, how='outer')
result

Out[351]:

	A	B	C	D
0	1	4	2.0	7.0
1	2	5	3.0	8.0
2	3	6	NaN	NaN

2. Append Data¶

In Pandas, there are several ways to append data to a DataFrame. Here are some of the most common ways:¶

Using the append() method:¶

In [352]:

import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': ['1', '2', '3', '4']})
df2 = pd.DataFrame({'col1': [5, 6, 7], 'col3': [1005, 1006, 1007]})

In [353]:

df = df1.append(df2, ignore_index=True)
df

C:\Users\ragarwal\AppData\Local\Temp\ipykernel_14508\733254066.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df = df1.append(df2, ignore_index=True)

Out[353]:

	col1	col2	col3
0	1	1	NaN
1	2	2	NaN
2	3	3	NaN
3	4	4	NaN
4	5	NaN	1005.0
5	6	NaN	1006.0
6	7	NaN	1007.0

In [354]:

df.dtypes

Out[354]:

col1      int64
col2     object
col3    float64
dtype: object

In [355]:

#SQL Union

Note: The ignore_index=True parameter is used to reset the index of the appended DataFrame to avoid duplicate index values.

Drop all additional columns from the appended DataFrame by specifying the 'drop()' method with the 'columns' parameter set to the list of additional columns.¶

In [356]:

# Create two DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [7, 8, 9], 'C': [10, 11, 12]})

In [357]:

# Append df2 to df1 and drop additional columns
result = df1.append(df2, ignore_index=True).drop(columns=df2.columns.difference(df1.columns))
result

C:\Users\ragarwal\AppData\Local\Temp\ipykernel_14508\397557649.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  result = df1.append(df2, ignore_index=True).drop(columns=df2.columns.difference(df1.columns))

Out[357]:

	A	B
0	1	4.0
1	2	5.0
2	3	6.0
3	7	NaN
4	8	NaN
5	9	NaN

What happens if same column is Char in one and numeric in another ?¶

In [358]:

import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': ['1', '2', '3', '4']})
df2 = pd.DataFrame({'col1': ['5', '6', '7'], 'col3': [1005, 1006, 1007]})

df = df1.append(df2, ignore_index=True)
df

C:\Users\ragarwal\AppData\Local\Temp\ipykernel_14508\986494925.py:5: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df = df1.append(df2, ignore_index=True)

Out[358]:

	col1	col2	col3
0	1	1	NaN
1	2	2	NaN
2	3	3	NaN
3	4	4	NaN
4	5	NaN	1005.0
5	6	NaN	1006.0
6	7	NaN	1007.0

In [359]:

df.dtypes

Out[359]:

col1     object
col2     object
col3    float64
dtype: object

Using the concat() function: This function concatenates two or more DataFrames along a specified axis.¶¶

In [360]:

result = pd.concat([df1, df2], ignore_index=True)
result

Out[360]:

	col1	col2	col3
0	1	1	NaN
1	2	2	NaN
2	3	3	NaN
3	4	4	NaN
4	5	NaN	1005.0
5	6	NaN	1006.0
6	7	NaN	1007.0

Using the insert() method: This method can be used to insert a new column to a DataFrame at a specified position.¶

In [361]:

df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

df.insert(1, 'new_col', [5, 6])

df

Out[361]:

	col1	new_col	col2
0	1	5	3
1	2	6	4

Note: The resulting DataFrame has three columns, with the new column inserted at position 1.

Module 14 : Complex data transformation techniques : transform, .pivot() .melt() .transpose()¶

In [362]:

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
    'year': [2019, 2019, 2019, 2020, 2020, 2020],
    'sales': [100, 200, 150, 120, 180, 220]
})
df

Out[362]:

	name	year	sales
0	Alice	2019	100
1	Bob	2019	200
2	Charlie	2019	150
3	Alice	2020	120
4	Bob	2020	180
5	Charlie	2020	220

In [363]:

df['mean_sales'] = df.groupby(['year'])['sales'].transform('mean')
df

Out[363]:

	name	year	sales	mean_sales
0	Alice	2019	100	150.000000
1	Bob	2019	200	150.000000
2	Charlie	2019	150	150.000000
3	Alice	2020	120	173.333333
4	Bob	2020	180	173.333333
5	Charlie	2020	220	173.333333

.pivot() method in pandas¶

Utility is quite similiar to Pivot table of MS Excel or Proc Transpose of SAS¶

Basic syntax: df.pivot(index=None, columns=None, values=None)

index: This is the column or list of columns that will be used to index the rows of the pivoted dataframe.
columns: This is the column that contains the values to be used as column headers in the pivoted dataframe.
values: This is the column that contains the values to be used as the values of the pivoted dataframe.

Example 1:¶

In [364]:

pivoted_df = df.pivot(index='name', columns='year', values='sales')
pivoted_df

Out[364]:

year	2019	2020
name
Alice	100	120
Bob	200	180
Charlie	150	220

Example 2: Pivot with aggregation function¶

In [365]:

pivoted_df = df.pivot_table(index='year', values='sales', aggfunc='sum')
pivoted_df

Out[365]:

	sales
year
2019	450
2020	520

Example 3: Pivot with multiple index and column values¶

In [366]:

df = pd.DataFrame({
    'name': ['Alice', 'Alice', 'Bob', 'Bob', 'Charlie', 'Charlie'],
    'year': [2019, 2020, 2019, 2020, 2019, 2020],
    'quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],
    'sales': [100, 120, 200, 180, 150, 220] })
df

Out[366]:

	name	year	quarter	sales
0	Alice	2019	Q1	100
1	Alice	2020	Q2	120
2	Bob	2019	Q1	200
3	Bob	2020	Q2	180
4	Charlie	2019	Q1	150
5	Charlie	2020	Q2	220

In [367]:

pivoted_df = df.pivot(index=['name', 'quarter'], columns='year', values='sales')
pivoted_df

Out[367]:

	year	2019	2020
name	quarter
Alice	Q1	100.0	NaN
Alice	Q2	NaN	120.0
Bob	Q1	200.0	NaN
Bob	Q2	NaN	180.0
Charlie	Q1	150.0	NaN
Charlie	Q2	NaN	220.0

.melt() function : Unpivot a DataFrame from wide to long format.¶

Syntax: pd.melt(DF, id_vars=, value_vars=, var_name=, value_name='value')

Parameters :

id_var: This is index columns. Columns that should remain as-is in the output.
value_var: Columns that we want to convert to rows.
var_name: The name of the new column. How do you want to name your old columns under one column
value_name: The name of the column that will contain the stacked values.

In [368]:

data = {'Customer_id': ['cus_01', 'cus_02', 'cus_03', 'cus_04'], 
        'Days': ['Saturday','Saturday','Sunday','Sunday'], 
        'MinDuration':[2,4,4,2], 
        'MaxDuration':[4,6,6,4]} 
df = pd.DataFrame(data)
df

Out[368]:

	Customer_id	Days	MinDuration	MaxDuration
0	cus_01	Saturday	2	4
1	cus_02	Saturday	4	6
2	cus_03	Sunday	4	6
3	cus_04	Sunday	2	4

In [369]:

df_melt=pd.melt(df, id_vars=['Customer_id','Days'],
                    value_vars=['MinDuration','MaxDuration'],
                    var_name='DurationType',
                    value_name='Duration')
df_melt

Out[369]:

	Customer_id	Days	DurationType	Duration
0	cus_01	Saturday	MinDuration	2
1	cus_02	Saturday	MinDuration	4
2	cus_03	Sunday	MinDuration	4
3	cus_04	Sunday	MinDuration	2
4	cus_01	Saturday	MaxDuration	4
5	cus_02	Saturday	MaxDuration	6
6	cus_03	Sunday	MaxDuration	6
7	cus_04	Sunday	MaxDuration	4

".T" attribute or the ".transpose()" method to transpose a dataframe, which means to swap its rows and columns.¶

This can be useful when you want to change the orientation of your data or when you want to perform certain operations that require a different shape of data.¶

Example 1 : The shortcut for the transpose() function is .T¶

In [370]:

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [20, 30, 35],
    'city': ['Boston', 'Oakland', 'San Francisco']
})
df

Out[370]:

	name	age	city
0	Alice	20	Boston
1	Bob	30	Oakland
2	Charlie	35	San Francisco

In [371]:

transposed_df = df.T
transposed_df

Out[371]:

	0	1	2
name	Alice	Bob	Charlie
age	20	30	35
city	Boston	Oakland	San Francisco

The resulting dataframe has an index made up of the column names from the original dataframe, and columns made up of the original row values.

Example 2 : Transpose a DataFrame with a DateTimeIndex¶

In [372]:

idx = pd.date_range('2021-01-01', periods=3, freq='D')
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}, index=idx)
df

Out[372]:

	A	B	C
2021-01-01	1	4	7
2021-01-02	2	5	8
2021-01-03	3	6	9

In [373]:

new_df = df.T
new_df

Out[373]:

	2021-01-01	2021-01-02	2021-01-03
A	1	2	3
B	4	5	6
C	7	8	9

Module 15 : Miscellaneous¶

Of course, we have must have missed a lot of things, intentionally or unintentionally Please let us know if you want us to cover something, and we will add that here

In [374]:

# how to sort a data

In [375]:

#importing a csv file with pd.read_csv
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")

In [376]:

df.head()

Out[376]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email
0	22058	2/24/2015 22:55	4/18/2015 2:47	65278	QVPSPJUOCKZAR	SEO	Chrome	M	7.327584e+08	0	home_essentials	2/22/1976	Aaron P Maashoh	aaron.maashoh@yahoo.com
1	333320	6/7/2015 20:39	6/8/2015 1:38	96399	EOGFQPIZPYXFZ	Ads	Chrome	F	3.503114e+08	0	apparels	1/2/1962	Rick D Rothackerj	rick_rothackerj@yahoo.com
2	1359	1/1/2015 18:52	1/1/2015 18:52	57296	YSSKYOSJHPPLJ	SEO	Opera	M	2.621474e+09	1	electronics	3/5/1962	Harriet M McLeodd	harriet.mcleodd@gmail.com
3	150084	4/28/2015 21:13	5/4/2015 13:54	43650	ATGTXKYKUDUQN	SEO	Safari	M	3.840542e+09	0	health_care	7/3/1974	Sinead M Carews	sinead.carews@gmail.com
4	221365	7/21/2015 7:09	9/9/2015 18:40	45016	NAUITBZFJKHWW	Ads	Safari	M	4.155831e+08	0	home_essentials	8/25/1970	Laurence H Frosty	laurence.frosty@tntech.edu

In [377]:

df.sort_values(['source','browser'],ascending= [True,False],inplace=True,ignore_index=True)
df.head()

Out[377]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	category	dob	name	email
0	221365	7/21/2015 7:09	9/9/2015 18:40	45016	NAUITBZFJKHWW	Ads	Safari	M	4.155831e+08	home_essentials	8/25/1970	Laurence H Frosty	laurence.frosty@tntech.edu
1	199700	7/11/2015 18:26	10/28/2015 21:59	7603	TEPSJVVXGNTYR	Ads	Safari	F	1.819009e+09	health_care	6/17/1980	Matt U Falloonz	matt.falloonz@brandermill.com
2	119824	3/20/2015 0:31	4/5/2015 7:31	51106	WFIIFCPIOGMHT	Ads	Safari	M	1.314238e+05	cosmetics	1/16/1977	Herbert P Lashb	herbert.lashb@gmail.com
3	192767	1/8/2015 12:17	4/28/2015 2:09	85643	ZTORJKTDMYTJH	Ads	Safari	M	2.568640e+09	electronics	2/21/1983	Costa P Alisterq	costa.alisterq@flossresearch.com
4	310147	7/6/2015 5:45	7/22/2015 4:19	70400	GAPQDLEFDZYCO	Ads	Safari	F	2.626781e+09	apparels	8/20/1995	Aileen E Wangy	aileen.wangy@brandermill.com

In [378]:

# How to delete all the data from a Dataframe

In [379]:

df.drop(df.index , inplace=True)
df.head()

Out[379]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	ip_address	class	category	dob	name	email

Pages

Python application part 1