Data Wrangling with Pandas¶
Data wrangling is the practice of converting data from a "raw" form into a user-ready form for descriptive analytics and provide feed for other horizons of analytics such as predictive analytics.
*Pandas is a data-centric package(library) of Python eco-system for importing, manipulating, managing and analyzing data. This library was originally built on NumPy, the fundamental library for scientific computation in Python¶
*A DataFrame is a two-dimensional data structure. We can perform many operations on these datasets such as arithmetic operation, columns/rows selection, columns/rows addition etc.¶
Module 0: Before jumping to Pandas, let's cover few basics of Python¶
# There are various type of variables in Python
a = 1 # integer
b = 1.08 # float
c = "Ask Analytics" # string
d = True #boolean
print("a is of type: ", type(a) , ", b is of type: ",type(b), ", c is of type: ",type(c), ", d is of type: ",type(d))
a is of type: <class 'int'> , b is of type: <class 'float'> , c is of type: <class 'str'> , d is of type: <class 'bool'>
# We can convert the type of variables:
a = 1
b = float(a)
b
1.0
a = 1
b = str(a)
b
'1'
#Python name rules:
#Names are case sensitive, can not start with number or any special character except underscore ('_').
#Names can not be key words such as : False, True, None, lambda, etc.
#Like any language, there are basic operators in Python as well.
#Most of the operators are same as those in SQL or any other languages.
#-------------- Arithmetic Operators ------------------------------------------
#+ Addition - Subtraction * Multiplication / Division
#% Modulus ** Power // Floor Division
#------------ Comparison Operators ------------------------------------------
#== Equal != Not equal
#> more than < less than
#>= more than or equal to <= less than or equal to
#------------ Logical Operators ------------------------------------------
#and (at places we also use '&' for and)
#or (we can also use '|' for or )
#not (we can also use '~' for not, at places)
# Arithmetic Operators
print('5 + 4 = ' , 5 + 4)
print('5 - 4 = ' , 5 - 4)
print('5 * 4 = ' , 5 * 4)
print('5 / 4 = ' , 5 / 4)
print('7 % 4 = ' , 7 % 4)
print('7 // 4 = ', 7 // 4)
print('7 ** 4 = ', 7 ** 4)
5 + 4 = 9 5 - 4 = 1 5 * 4 = 20 5 / 4 = 1.25 7 % 4 = 3 7 // 4 = 1 7 ** 4 = 2401
# Comparison Operators
print('5 == 4', 5 == 4)
print('5 != 4', 5 != 4)
print('5 < 4', 5 < 4)
print('5 > 4', 5 > 4)
print('5 <= 4', 5 <= 4)
print('5 >= 4', 5 >= 4)
5 == 4 False 5 != 4 True 5 < 4 False 5 > 4 True 5 <= 4 False 5 >= 4 True
# Logical Operators
print('5 > 4 or 3 < 2 will resolve to:', 5 > 4 or 3 < 2)
print('5 > 4 and 3 < 2 will resolve to:', 5 > 4 and 3 < 2)
print('not(5 > 4 and 3 < 2) will resolve to:', not(5 > 4 and 3 < 2))
5 > 4 or 3 < 2 will resolve to: True 5 > 4 and 3 < 2 will resolve to: False not(5 > 4 and 3 < 2) will resolve to: True
Loops in Python¶
for number in range(4):
print(number)
0 1 2 3
for number in range(1, 10, 2):
print(number, end= ' ') # default end is 'end of line'm but you can customize it
1 3 5 7 9
for character in "Ask Analytics":
print(character, end= '#')
A#s#k# #A#n#a#l#y#t#i#c#s#
for outer_loop_number in range(0, 2):
# above one is an outer loop, and below would be a inner loop
for inner_loop_character in "XYZ":
print(outer_loop_number, inner_loop_character)
0 X 0 Y 0 Z 1 X 1 Y 1 Z
#While Loop, it keep on running, until the condition is true
x = 0
while (x <= 5):
print(x)
x = x + 2
0 2 4
We can use three methods to jump in the loops : break, continue and pass¶
#Break statement -
x = 0
for x in range(10):
if x == 7:
break
print(x)
# Basiscally, it kept on running until, it encountered 7
# the moment it met the condition for "break", it came out of the loop
0 1 2 3 4 5 6
#Continue statement -
x = 0
for x in range(10):
if x == 7:
continue
print(x)
# It kept on running till end, but when it met the condition of "continue", it skipped that particular iteration only
# so in simpler english, it is "skip"
0 1 2 3 4 5 6 8 9
#pass statement -
x = 0
for x in range(10):
if x == 7:
pass
print(x)
# With pass, it actually didn't do anything,
# but the pass statement became a "placeholder", where in future you can put some code with action
0 1 2 3 4 5 6 7 8 9
Basic data strucures in Python : List, tuples, dictionary and sets¶
# Here is an example of List
list_1 = [1,3,5,7,9]
print(list_1)
[1, 3, 5, 7, 9]
# In python, index of elements start with 0 from left: 0--> 1-->2-->3.
# From right side index is like: -4 <-- -3 <-- -2 <-- -1
# That's why in this tutorial, index of modules starts with zero!
list_1[0] # first element
1
list_1[-1] # last element
9
list_1[1:4] # element with index 1, 2 and 3. so in the range, last one is not included
[3, 5, 7]
list_1[0:5:2] # element with index 0,2,4
[1, 5, 9]
list_1[::-1] # reversed list
[9, 7, 5, 3, 1]
list_1.append(11) # append can add only one element (individual element or one list) at a time
list_1
[1, 3, 5, 7, 9, 11]
list_1.append([13,15])
list_1
[1, 3, 5, 7, 9, 11, [13, 15]]
list_1.extend([13,15]) # extend can add more than one element at a time
list_1
[1, 3, 5, 7, 9, 11, [13, 15], 13, 15]
list_1.insert(2,'x') # insert 'x' at 2nd index position
list_1
[1, 3, 'x', 5, 7, 9, 11, [13, 15], 13, 15]
Module 1: Importing data in as pandas DataFrame - with read_csv and read_excel and read_json¶
Let's first read CSV¶
import pandas as pd
#importing a simple csv file with pd.read_csv
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df.head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 7.327584e+08 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
1 | 333320 | 6/7/2015 20:39 | 6/8/2015 1:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 3.503114e+08 | 0 | apparels | 1/2/1962 | Rick D Rothackerj | rick_rothackerj@yahoo.com |
2 | 1359 | 1/1/2015 18:52 | 1/1/2015 18:52 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 3/5/1962 | Harriet M McLeodd | harriet.mcleodd@gmail.com |
3 | 150084 | 4/28/2015 21:13 | 5/4/2015 13:54 | 43650 | ATGTXKYKUDUQN | SEO | Safari | M | 3.840542e+09 | 0 | health_care | 7/3/1974 | Sinead M Carews | sinead.carews@gmail.com |
4 | 221365 | 7/21/2015 7:09 | 9/9/2015 18:40 | 45016 | NAUITBZFJKHWW | Ads | Safari | M | 4.155831e+08 | 0 | home_essentials | 8/25/1970 | Laurence H Frosty | laurence.frosty@tntech.edu |
#importing a csv file where data starts at line#5, first the incorrect way
df0 = pd.read_csv("input_datsets/Sample_Fraud_Detection _with_comments.csv")
df0.head()
This is a sample data | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | it has 14 columns | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | it has 1000 records | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | |
4 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 732758368.8 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
#importing a csv file where data starts at line#5, now the correct way
df1 = pd.read_csv("input_datsets/Sample_Fraud_Detection _with_comments.csv", skiprows= 4)
df1.columns = df.columns
df1.head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 7.327584e+08 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
1 | 333320 | 6/7/2015 20:39 | 6/8/2015 1:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 3.503114e+08 | 0 | apparels | 1/2/1962 | Rick D Rothackerj | rick_rothackerj@yahoo.com |
2 | 1359 | 1/1/2015 18:52 | 1/1/2015 18:52 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 3/5/1962 | Harriet M McLeodd | harriet.mcleodd@gmail.com |
3 | 150084 | 4/28/2015 21:13 | 5/4/2015 13:54 | 43650 | ATGTXKYKUDUQN | SEO | Safari | M | 3.840542e+09 | 0 | health_care | 7/3/1974 | Sinead M Carews | sinead.carews@gmail.com |
4 | 221365 | 7/21/2015 7:09 | 9/9/2015 18:40 | 45016 | NAUITBZFJKHWW | Ads | Safari | M | 4.155831e+08 | 0 | home_essentials | 8/25/1970 | Laurence H Frosty | laurence.frosty@tntech.edu |
# Skiprow method can also be used along with nrows option to read specific rows of the data, especially when data is huge
df1 = pd.read_csv("input_datsets/Sample_Fraud_Detection _with_comments.csv", skiprows= 20, nrows = 5)
df1.columns = df.columns
df1.head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 78986 | 5/15/2015 3:52 | 8/11/2015 2:29 | 66143 | TGHVAWBMZRDHH | SEO | FireFox | M | 3.503883e+09 | 0 | apparels | 4/12/1992 | Simon W Jessope | simon.jessope@gmail.com |
1 | 119824 | 3/20/2015 0:31 | 4/5/2015 7:31 | 51106 | WFIIFCPIOGMHT | Ads | Safari | M | 1.314238e+05 | 0 | cosmetics | 1/16/1977 | Herbert P Lashb | herbert.lashb@gmail.com |
2 | 357386 | 2/3/2015 0:48 | 3/24/2015 18:27 | 63268 | NWSVDOHYIOBDR | Ads | FireFox | M | 3.037372e+09 | 0 | apparels | 9/20/1991 | Tim Z Hepherp | tim.hepherp@hotmail.com |
3 | 289172 | 7/17/2015 5:48 | 11/12/2015 22:08 | 46018 | KFZGQIWDIRLZB | Direct | FireFox | F | 1.044590e+09 | 0 | cosmetics | 8/19/1962 | Isabel U Reynoldsb | isabel_reynoldsb@rigginglifting.com |
4 | 154699 | 7/8/2015 15:35 | 7/20/2015 0:11 | 23319 | TYZPZLEWNJJRI | Direct | FireFox | F | 3.847612e+09 | 0 | apparels | 10/6/1992 | Christopher Z Doeringv | christopher_doeringv@gmail.com |
#importing selective columns
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv", usecols= ['user_id','purchase_value','purchase_time'])
df.head()
user_id | purchase_time | purchase_value | |
---|---|---|---|
0 | 22058 | 4/18/2015 2:47 | 65278 |
1 | 333320 | 6/8/2015 1:38 | 96399 |
2 | 1359 | 1/1/2015 18:52 | 57296 |
3 | 150084 | 5/4/2015 13:54 | 43650 |
4 | 221365 | 9/9/2015 18:40 | 45016 |
Let's now read Excel file¶
#importing an excel file with pd.read_excel
df_read_excel = pd.read_excel("input_datsets/Sample_Fraud_Detection.xlsx") # will import the first tab by default
excel_data_df = pd.read_excel('input_datsets/Sample_Fraud_Detection.xlsx', sheet_name='Purchase') # will import the specific tab
excel_data_df.head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2015-02-24 22:55:49 | 2015-04-18 02:47:11 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 7.327584e+08 | 0 | home_essentials | 1976-02-22 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
1 | 333320 | 2015-06-07 20:39:50 | 2015-06-08 01:38:54 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 3.503114e+08 | 0 | apparels | 1962-01-02 | Rick D Rothackerj | rick_rothackerj@yahoo.com |
2 | 1359 | 2015-01-01 18:52:44 | 2015-01-01 18:52:45 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 1962-03-05 | Harriet M McLeodd | harriet.mcleodd@gmail.com |
3 | 150084 | 2015-04-28 21:13:25 | 2015-05-04 13:54:50 | 43650 | ATGTXKYKUDUQN | SEO | Safari | M | 3.840542e+09 | 0 | health_care | 1974-07-03 | Sinead M Carews | sinead.carews@gmail.com |
4 | 221365 | 2015-07-21 07:09:52 | 2015-09-09 18:40:53 | 45016 | NAUITBZFJKHWW | Ads | Safari | M | 4.155831e+08 | 0 | home_essentials | 1970-08-25 | Laurence H Frosty | laurence.frosty@tntech.edu |
Read Json (JavaScript Object Notation) files¶
# Recently, Json has become a popular format of data. Let's learn how to read and normalize a Json data file
# read the json file
data1 = pd.read_json('input_datsets/sample1.json')
data1
fruit | size | color | |
---|---|---|---|
0 | Apple | Large | Red |
data2 = pd.read_json('input_datsets/sample2.json')
data2
firstName | lastName | gender | age | address | phoneNumbers | |
---|---|---|---|---|---|---|
0 | Joe | Jackson | male | 28 | {'streetAddress': '101', 'city': 'San Diego', ... | [{'type': 'home', 'number': '7349282382'}] |
#Above data is difficult to digest, let's make it more edible
data2_part_1 = data2[['firstName','lastName','gender','age']]
# Address can be normalized following way, since it is a dictionary
Address = pd.json_normalize(data2.address, max_level=1)
Address
streetAddress | city | state | |
---|---|---|---|
0 | 101 | San Diego | CA |
# phoneNumbers can be normalized following way, since it is a dictionary within a List
phoneNumbers = pd.json_normalize(data2.phoneNumbers[0], max_level=1)
phoneNumbers
type | number | |
---|---|---|
0 | home | 7349282382 |
FINAL_data2 = pd.concat([data2_part_1,Address,phoneNumbers],axis = 1)
FINAL_data2
firstName | lastName | gender | age | streetAddress | city | state | type | number | |
---|---|---|---|---|---|---|---|---|---|
0 | Joe | Jackson | male | 28 | 101 | San Diego | CA | home | 7349282382 |
data3 = pd.read_json('input_datsets/sample3.json')
data3
student_roll_no | details | |
---|---|---|
0 | 101 | {'name': 'Andew', 'age': 12, 'grade': 'A'} |
1 | 102 | {'name': 'Bhuvan', 'age': 18, 'grade': 'B'} |
2 | 103 | {'name': 'Clinton', 'age': 11, 'grade': 'A'} |
3 | 104 | {'name': 'Drake', 'age': 12, 'grade': 'C'} |
4 | 105 | {'name': 'Eisha', 'age': 13, 'grade': 'B'} |
5 | 106 | {'name': 'Farhan', 'age': 22, 'grade': 'C'} |
6 | 107 | {'name': 'Garima', 'age': 11, 'grade': 'A'} |
7 | 108 | {'name': 'Himanshu', 'age': 19, 'grade': 'A'} |
8 | 109 | {'name': 'Ishaan', 'age': 10, 'grade': 'D'} |
9 | 110 | {'name': 'Jason', 'age': 9, 'grade': 'B'} |
student_roll_no = data3.student_roll_no
details = pd.json_normalize(data3.details)
FINAL_data3 = pd.concat([student_roll_no,details],axis = 1)
FINAL_data3
student_roll_no | name | age | grade | |
---|---|---|---|---|
0 | 101 | Andew | 12 | A |
1 | 102 | Bhuvan | 18 | B |
2 | 103 | Clinton | 11 | A |
3 | 104 | Drake | 12 | C |
4 | 105 | Eisha | 13 | B |
5 | 106 | Farhan | 22 | C |
6 | 107 | Garima | 11 | A |
7 | 108 | Himanshu | 19 | A |
8 | 109 | Ishaan | 10 | D |
9 | 110 | Jason | 9 | B |
data4 = pd.read_json('input_datsets/sample4.json')
data4
people | |
---|---|
0 | {'firstName': 'Joe', 'lastName': 'Jackson', 'g... |
1 | {'firstName': 'James', 'lastName': 'Smith', 'g... |
2 | {'firstName': 'Emily', 'lastName': 'Jones', 'g... |
# phoneNumbers can be normalized following way, since it is a dictionary within a List
data4_1 = pd.json_normalize(data4.people[0], max_level=1)
data4_1
firstName | lastName | gender | age | number | |
---|---|---|---|---|---|
0 | Joe | Jackson | male | 28 | 7349282382 |
Module 2: Let's do some basic check of data¶
# check top 5 rows
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df.head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 7.327584e+08 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
1 | 333320 | 6/7/2015 20:39 | 6/8/2015 1:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 3.503114e+08 | 0 | apparels | 1/2/1962 | Rick D Rothackerj | rick_rothackerj@yahoo.com |
2 | 1359 | 1/1/2015 18:52 | 1/1/2015 18:52 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 3/5/1962 | Harriet M McLeodd | harriet.mcleodd@gmail.com |
3 | 150084 | 4/28/2015 21:13 | 5/4/2015 13:54 | 43650 | ATGTXKYKUDUQN | SEO | Safari | M | 3.840542e+09 | 0 | health_care | 7/3/1974 | Sinead M Carews | sinead.carews@gmail.com |
4 | 221365 | 7/21/2015 7:09 | 9/9/2015 18:40 | 45016 | NAUITBZFJKHWW | Ads | Safari | M | 4.155831e+08 | 0 | home_essentials | 8/25/1970 | Laurence H Frosty | laurence.frosty@tntech.edu |
# check bottom 4 rows
df.tail(4)
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
995 | 328579 | 4/10/2015 1:02 | 6/25/2015 2:05 | 54775 | UQCGEFBSOAVIJ | SEO | Chrome | M | 1.038384e+09 | 0 | electronics | 4/6/1994 | Praveen L Menonw | praveen.menonw@gmail.com |
996 | 263174 | 7/14/2015 2:13 | 7/24/2015 1:25 | 84302 | HVSGFVZSDBKEH | Ads | Safari | M | 4.169080e+09 | 0 | electronics | 9/1/1979 | Ra T Alpern | ra.alpern@flossresearch.com |
997 | 117191 | 2/28/2015 17:14 | 4/14/2015 15:39 | 20193 | GDHCTKIKPHENW | Direct | IE | M | 1.411318e+09 | 0 | apparels | 11/17/1979 | Julia D Edwardsj | julia.edwardsj@sutletgroup.com |
998 | 65732 | 1/11/2015 11:10 | 1/22/2015 4:16 | 26225 | VSMNAOFPSEQOL | Ads | IE | M | 3.765208e+09 | 0 | apparels | 3/23/1988 | Conor V Humphriesk | conor.humphriesk@gmail.com |
# Basic summarization of a variable "below \n has been used to mark end of the line"
print(
df['purchase_value'].mean(),"\n",
df['purchase_value'].max(),"\n",
df['purchase_value'].min(),"\n",
df['purchase_value'].sum()
)
50992.49149149149 99743 1176 50941499
# mean of all the numeric variables
df.mean(numeric_only=True)
user_id 1.963437e+05 purchase_value 5.099249e+04 ip_address 2.099747e+09 class 1.011011e-01 dtype: float64
df[['class','purchase_value']].sum()
class 101 purchase_value 50941499 dtype: int64
#Default is the descriptive statistics of only the numeric columns
df.describe()
user_id | purchase_value | ip_address | class | |
---|---|---|---|---|
count | 999.000000 | 999.000000 | 9.990000e+02 | 999.000000 |
mean | 196343.719720 | 50992.491491 | 2.099747e+09 | 0.101101 |
std | 116166.717945 | 28681.076653 | 1.262366e+09 | 0.301614 |
min | 170.000000 | 1176.000000 | 1.314238e+05 | 0.000000 |
25% | 95017.500000 | 26038.500000 | 9.856234e+08 | 0.000000 |
50% | 194882.000000 | 52507.000000 | 2.076032e+09 | 0.000000 |
75% | 297765.000000 | 75653.500000 | 3.196621e+09 | 0.000000 |
max | 399975.000000 | 99743.000000 | 4.294263e+09 | 1.000000 |
#Descriptive statistics of all categorical column
df.describe(include=[object])
signup_time | purchase_time | device_id | source | browser | sex | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|
count | 999 | 999 | 999 | 999 | 999 | 974 | 999 | 970 | 999 | 999 |
unique | 995 | 998 | 994 | 3 | 5 | 2 | 5 | 917 | 999 | 998 |
top | 1/7/2015 16:57 | 1/11/2015 16:59 | BWSMVSLCJXMCM | Ads | Chrome | M | cosmetics | 2/1/1993 | Aaron P Maashoh | john.odonnell@gmail.com |
freq | 2 | 2 | 2 | 408 | 397 | 560 | 210 | 3 | 1 | 2 |
#Descriptive statistics of any column
df["device_id"].describe()
count 999 unique 994 top BWSMVSLCJXMCM freq 2 Name: device_id, dtype: object
Use following codes to identify how many NULL values are there in each field¶
df.isna().sum()
user_id 0 signup_time 0 purchase_time 0 purchase_value 0 device_id 0 source 0 browser 0 sex 25 ip_address 0 class 0 category 0 dob 29 name 0 email 0 dtype: int64
df.isnull().sum() # sames as above
user_id 0 signup_time 0 purchase_time 0 purchase_value 0 device_id 0 source 0 browser 0 sex 25 ip_address 0 class 0 category 0 dob 29 name 0 email 0 dtype: int64
df.nunique() # gives count distinct values in each field
user_id 999 signup_time 995 purchase_time 998 purchase_value 994 device_id 994 source 3 browser 5 sex 2 ip_address 995 class 2 category 5 dob 917 name 999 email 998 dtype: int64
#Unique values of a single column
df.browser.nunique()
5
df.browser.value_counts()
Chrome 397 IE 260 Safari 166 FireFox 155 Opera 21 Name: browser, dtype: int64
x = df[['browser','source']].value_counts() # Saving the results in a DataFrame
print(x)
browser source Chrome SEO 165 Ads 151 IE Ads 109 SEO 103 Chrome Direct 81 Safari Ads 75 FireFox Ads 64 SEO 64 Safari SEO 58 IE Direct 48 Safari Direct 33 FireFox Direct 27 Opera Ads 9 SEO 9 Direct 3 dtype: int64
# Another way to do frequency analysis is - crosstab.
# Crosstab is frequency analysis with two variables
pd.crosstab(df['browser'], df['sex'])
sex | F | M |
---|---|---|
browser | ||
Chrome | 167 | 218 |
FireFox | 61 | 89 |
IE | 110 | 144 |
Opera | 8 | 11 |
Safari | 68 | 98 |
#More on such summarization will be covered in Module #14
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 999 entries, 0 to 998 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 999 non-null int64 1 signup_time 999 non-null object 2 purchase_time 999 non-null object 3 purchase_value 999 non-null int64 4 device_id 999 non-null object 5 source 999 non-null object 6 browser 999 non-null object 7 sex 974 non-null object 8 ip_address 999 non-null float64 9 class 999 non-null int64 10 category 999 non-null object 11 dob 970 non-null object 12 name 999 non-null object 13 email 999 non-null object dtypes: float64(1), int64(3), object(10) memory usage: 109.4+ KB
df.columns
Index(['user_id', 'signup_time', 'purchase_time', 'purchase_value', 'device_id', 'source', 'browser', 'sex', 'ip_address', 'class', 'category', 'dob', 'name', 'email'], dtype='object')
len(df) # gives numbers of records
999
df.shape # gives numbers of records and number of fields in data
(999, 14)
Module 4 : Subsetting data Vertically and Horizontally (selecting subset of a DataFrame)¶
Selecting only few columns from a DataFrame¶
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
sample = df.name
sample.head(3)
0 Aaron P Maashoh 1 Rick D Rothackerj 2 Harriet M McLeodd Name: name, dtype: object
sample = df[["name","dob","sex"]]
sample.head(3)
name | dob | sex | |
---|---|---|---|
0 | Aaron P Maashoh | 2/22/1976 | M |
1 | Rick D Rothackerj | 1/2/1962 | F |
2 | Harriet M McLeodd | 3/5/1962 | M |
#### Selecting few rows from DataFrame
s10 = df[10:13] # this will give us data from 10th to 12th record
s10
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 199700 | 7/11/2015 18:26 | 10/28/2015 21:59 | 7603 | TEPSJVVXGNTYR | Ads | Safari | F | 1.819009e+09 | 0 | health_care | 6/17/1980 | Matt U Falloonz | matt.falloonz@brandermill.com |
11 | 73884 | 5/29/2015 16:22 | 6/16/2015 5:45 | 69508 | ZTZZJUCRDOCJZ | Direct | Chrome | M | 4.038285e+09 | 0 | cosmetics | 6/7/1983 | Emily H Flitterl | emily.flitterl@tntech.edu |
12 | 79203 | 6/16/2015 21:19 | 6/21/2015 3:29 | 31999 | IBPNKSMCKUZWD | SEO | Safari | M | 4.161541e+09 | 0 | apparels | 6/19/1982 | Olivia X Oranr | olivia_oranr@sutletgroup.com |
Location (index) and label based subsetting of data¶
# Dictionary with key:value pair
technologies = {'Courses':["Spark","PySpark","Hadoop","Python","pandas"],
'Fee' :[20000,25000,26000,22000,24000],
'Duration':['30day','40days','35days','40days','60days'],
'Discount':[1000,2300,1200,2500,2000]
}
simple_data = pd.DataFrame(technologies)
simple_data
Courses | Fee | Duration | Discount | |
---|---|---|---|---|
0 | Spark | 20000 | 30day | 1000 |
1 | PySpark | 25000 | 40days | 2300 |
2 | Hadoop | 26000 | 35days | 1200 |
3 | Python | 22000 | 40days | 2500 |
4 | pandas | 24000 | 60days | 2000 |
We can make a column as Index as well¶
new_data = simple_data.set_index('Courses')
new_data
Fee | Duration | Discount | |
---|---|---|---|
Courses | |||
Spark | 20000 | 30day | 1000 |
PySpark | 25000 | 40days | 2300 |
Hadoop | 26000 | 35days | 1200 |
Python | 22000 | 40days | 2500 |
pandas | 24000 | 60days | 2000 |
or can make a custom index¶
index_labels=['row1','row2','row3','row4','row5']
data_with_named_index = pd.DataFrame(technologies,index=index_labels)
data_with_named_index
Courses | Fee | Duration | Discount | |
---|---|---|---|---|
row1 | Spark | 20000 | 30day | 1000 |
row2 | PySpark | 25000 | 40days | 2300 |
row3 | Hadoop | 26000 | 35days | 1200 |
row4 | Python | 22000 | 40days | 2500 |
row5 | pandas | 24000 | 60days | 2000 |
# Select Single Row by Index 'Label': Selecting a single column using the ".loc" attribute
simple_data.loc[1]
Courses PySpark Fee 25000 Duration 40days Discount 2300 Name: 1, dtype: object
# Now let's try to use loc on the dataframe with named Index -- This should not work, You can cry without #
#data_with_named_index.loc[1]
# Select Single Row by Index Label: Selecting a single column using the ".loc" attribute
print(data_with_named_index.loc['row2'])
Courses PySpark Fee 25000 Duration 40days Discount 2300 Name: row2, dtype: object
# Select Single Row by Index
print(data_with_named_index.iloc[1])
Courses PySpark Fee 25000 Duration 40days Discount 2300 Name: row2, dtype: object
#this will give error
# print(data_with_named_index.iloc['row2'])
# this will give error
# df.loc[:,1:5]
# Select a set of Columns between two Labels(column-names)
slice0 = df.loc[:,'device_id':'class']
slice0.head(3)
device_id | source | browser | sex | ip_address | class | |
---|---|---|---|---|---|---|
0 | QVPSPJUOCKZAR | SEO | Chrome | M | 7.327584e+08 | 0 |
1 | EOGFQPIZPYXFZ | Ads | Chrome | F | 3.503114e+08 | 0 |
2 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 |
# Select Columns between two Indexes:
# Includes Index 1 & Excludes 4
slice00 = df.iloc[:,1:4]
slice00.head(3)
signup_time | purchase_time | purchase_value | |
---|---|---|---|
0 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 |
1 | 6/7/2015 20:39 | 6/8/2015 1:38 | 96399 |
2 | 1/1/2015 18:52 | 1/1/2015 18:52 | 57296 |
#Access elements with index label 40 to 100 by 10, we make use of .loc[]
slice1 = df.loc[40:70:10]
slice1
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
40 | 81113 | 8/16/2015 12:56 | 8/24/2015 5:00 | 31921 | BKQVBSSFGETUQ | Direct | Chrome | F | 3.682312e+09 | 0 | home_essentials | 11/14/1971 | Michael L Ermanl | michael.ermanl@brandermill.com |
50 | 254562 | 5/12/2015 18:58 | 7/31/2015 11:36 | 56245 | VYSNKKTTQIXQC | Ads | FireFox | F | 2.977122e+09 | 0 | health_care | 7/5/1970 | Sinead P Carewi | sinead.carewi@yahoo.com |
60 | 324683 | 1/18/2015 2:51 | 4/19/2015 22:24 | 34657 | JFVSIORUJBMXS | Ads | IE | M | 4.228028e+09 | 1 | home_essentials | NaN | Dhanya E Skariachand | dhanya.skariachand@gmail.com |
70 | 99299 | 7/22/2015 17:08 | 8/10/2015 20:44 | 65902 | JTHVCCWTCLEWQ | Ads | IE | F | 3.660879e+09 | 0 | home_essentials | 10/26/1991 | Laura W Noonanp | laura.noonanp@yahoo.com |
# selecting 0th, 2th, 4th, and 7th index rows
slice2 = df.loc[[0, 2, 4, 7]]
slice2
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 7.327584e+08 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
2 | 1359 | 1/1/2015 18:52 | 1/1/2015 18:52 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 3/5/1962 | Harriet M McLeodd | harriet.mcleodd@gmail.com |
4 | 221365 | 7/21/2015 7:09 | 9/9/2015 18:40 | 45016 | NAUITBZFJKHWW | Ads | Safari | M | 4.155831e+08 | 0 | home_essentials | 8/25/1970 | Laurence H Frosty | laurence.frosty@tntech.edu |
7 | 360585 | 4/6/2015 7:35 | 5/25/2015 17:21 | 74769 | HPUCUYLMJBYFW | Ads | Opera | M | 1.692459e+09 | 0 | health_care | 11/24/1981 | Lisa E Baertleinu | lisa.baertleinu@hotmail.com |
#selecting all rows and only the columns from 4 to 7
slice3 = df.iloc[:,4:7]
print(slice3.head(),"\n",slice3.shape)
device_id source browser 0 QVPSPJUOCKZAR SEO Chrome 1 EOGFQPIZPYXFZ Ads Chrome 2 YSSKYOSJHPPLJ SEO Opera 3 ATGTXKYKUDUQN SEO Safari 4 NAUITBZFJKHWW Ads Safari (999, 3)
# selecting rows from 40 to 60 by 10 and columns from 4 to 7
slice4 = df.iloc[40:70:10,4:7]
slice4
device_id | source | browser | |
---|---|---|---|
40 | BKQVBSSFGETUQ | Direct | Chrome |
50 | VYSNKKTTQIXQC | Ads | FireFox |
60 | JFVSIORUJBMXS | Ads | IE |
How to drop variables?¶
# option axis= is optional, but default is 0.
# 0 : 'index'
# 1 : 'columns'
# Remove columns by column-names
s1 = df.drop(['dob','name','email'], axis=1)
s1.head(2)
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 732758368.8 | 0 | home_essentials |
1 | 333320 | 6/7/2015 20:39 | 6/8/2015 1:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 350311387.9 | 0 | apparels |
# Remove all columns between column index 1 to 4
s2 = df.drop(df.iloc[:, 1:5], axis=1)
s2.head(2)
user_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | SEO | Chrome | M | 732758368.8 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
1 | 333320 | Ads | Chrome | F | 350311387.9 | 0 | apparels | 1/2/1962 | Rick D Rothackerj | rick_rothackerj@yahoo.com |
How to rename variables?¶
r1 = df.rename(columns = {'dob':'date_of_birth', 'sex':'gender'})
r1.head(2)
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | gender | ip_address | class | category | date_of_birth | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 732758368.8 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
1 | 333320 | 6/7/2015 20:39 | 6/8/2015 1:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 350311387.9 | 0 | apparels | 1/2/1962 | Rick D Rothackerj | rick_rothackerj@yahoo.com |
# Use option (inplace = True) if willing to make changes to the existing DataFrame
r1.rename(columns = {'date_of_birth':'Birth_date'}, inplace = True)
r1.head(2)
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | gender | ip_address | class | category | Birth_date | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 732758368.8 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
1 | 333320 | 6/7/2015 20:39 | 6/8/2015 1:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 350311387.9 | 0 | apparels | 1/2/1962 | Rick D Rothackerj | rick_rothackerj@yahoo.com |
Module 5: Conditional filter and LOGICAL operators similar to where clause in SAS/SQL¶
Extract rows with a single condition using operators¶
above = df[df.purchase_value > 90000]
above.purchase_value.min()
90023
Extract rows with multiple conditions using operators " | ". It filters data similar to OR operater in where caluse of SAS/SQL.¶
chrm_ie = df[(df.browser == "Chrome") | (df.browser == "IE")]
chrm_ie.browser.value_counts()
Chrome 397 IE 260 Name: browser, dtype: int64
.isin() : multiple "OR" conditions over a single column¶
chrm_ie = df[df.browser.isin(["Chrome", "IE","FireFox"])]
chrm_ie.browser.value_counts()
Chrome 397 IE 260 FireFox 155 Name: browser, dtype: int64
use "&" symbol in pandas as an “AND” operator of SQL. df [(condition1) & (condition2)]¶
above_chrm = df[(df.purchase_value > 90000) & (df.browser != "Chrome")]
above_chrm.purchase_value.min()
90133
Using a NOT operator, to inverse above condition¶
above_chrm = df[~((df.purchase_value > 90000) & (df.browser != "Chrome"))]
above_chrm.purchase_value.max()
99743
use "between" to filter a numerical column between a range of values¶
btw = df[df['purchase_value'].between(80000, 90000)]
btw.purchase_value.min(), btw.purchase_value.max()
(80046, 89915)
How to use contains condition in where clause¶
data_subset = df[df['source'].str.contains("SE")]
data_subset.source.value_counts()
SEO 399 Name: source, dtype: int64
Unlike SAS and SQL, there is no direct LIKE % functionality in Python or Pandas, however we can use it indirectly or using regular expressions. We have covered it later in this tutorial.¶
Module 6 : Derive a new column(s)¶
SELECT *, purchase_value/82.52 as convt FROM tips;
df["Value_in_USD"] = df.purchase_value/82.52
df[["purchase_value","Value_in_USD"]].head()
purchase_value | Value_in_USD | |
---|---|---|
0 | 65278 | 791.056714 |
1 | 96399 | 1168.189530 |
2 | 57296 | 694.328648 |
3 | 43650 | 528.962676 |
4 | 45016 | 545.516238 |
map() function for series [one column / field] is used to substitute each value in a series with another value¶
# "f" : This formatter is specifically used for formatting float values
# {,.2f} : Adds a comma and keeps to two decimal points
# {.2f} : keeps to two decimal points
# "$" : used to add a dollar currency sign
# You can mix and match from above option, like the below example,
# we are using a mix of dollar, comma and rounding to two decimal points [acutal value doesn't change]
df["Value_in_USD"] = (df.purchase_value/82.52).map('${:}'.format)
df[["purchase_value","Value_in_USD"]].head(2)
purchase_value | Value_in_USD | |
---|---|---|
0 | 65278 | $791.0567135239942 |
1 | 96399 | $1168.189529810955 |
df["Value_in_USD"] = (df.purchase_value/82.52).map('${: ,.2f}'.format)
df[["purchase_value","Value_in_USD"]].head(2)
purchase_value | Value_in_USD | |
---|---|---|
0 | 65278 | $ 791.06 |
1 | 96399 | $ 1,168.19 |
Add counter : column with incremental values from 1 onwards with range(start, stop, step)¶
start = 1
df['increment_by_1'] = range(start, start + df.shape[0])
df[['user_id', 'increment_by_1']].head()
user_id | increment_by_1 | |
---|---|---|
0 | 22058 | 1 |
1 | 333320 | 2 |
2 | 1359 | 3 |
3 | 150084 | 4 |
4 | 221365 | 5 |
print('Count of row :', df.shape[0], "\n",
'Count of columns :', df.shape[1])
Count of row : 999 Count of columns : 16
# with step size 5
start = 0
df['increment_by_5'] = range(start, start + (5 * df.shape[0]), 5)
df[['user_id', 'increment_by_1', 'increment_by_5']].head()
user_id | increment_by_1 | increment_by_5 | |
---|---|---|---|
0 | 22058 | 1 | 0 |
1 | 333320 | 2 | 5 |
2 | 1359 | 3 | 10 |
3 | 150084 | 4 | 15 |
4 | 221365 | 5 | 20 |
df["email_length"] = df.email.str.len()
df[["email","email_length"]].head()
email_length | ||
---|---|---|
0 | aaron.maashoh@yahoo.com | 23 |
1 | rick_rothackerj@yahoo.com | 25 |
2 | harriet.mcleodd@gmail.com | 25 |
3 | sinead.carews@gmail.com | 23 |
4 | laurence.frosty@tntech.edu | 26 |
.str.find() - finds the first occurenece position of a character in a column of strings, this is CASE Sensitive function¶
df["Position_of_@"] = df.email.str.find('@')
df["Position_of_a"] = df.email.str.find('a')
df[["email", "Position_of_@",'Position_of_a']].head()
Position_of_@ | Position_of_a | ||
---|---|---|---|
0 | aaron.maashoh@yahoo.com | 13 | 0 |
1 | rick_rothackerj@yahoo.com | 15 | 9 |
2 | harriet.mcleodd@gmail.com | 15 | 1 |
3 | sinead.carews@gmail.com | 13 | 4 |
4 | laurence.frosty@tntech.edu | 15 | 1 |
.str.findall() - It is similar to the above mentioned find() function, however, it will return ALL the positions of the character instead¶
df["all_a"] = df.email.str.findall("a")
df[["email","all_a"]].head()
all_a | ||
---|---|---|
0 | aaron.maashoh@yahoo.com | [a, a, a, a, a] |
1 | rick_rothackerj@yahoo.com | [a, a] |
2 | harriet.mcleodd@gmail.com | [a, a] |
3 | sinead.carews@gmail.com | [a, a, a] |
4 | laurence.frosty@tntech.edu | [a] |
df["all_a_length"] = df.all_a.str.len()
df[["email","all_a", "all_a_length"]].head()
all_a | all_a_length | ||
---|---|---|---|
0 | aaron.maashoh@yahoo.com | [a, a, a, a, a] | 5 |
1 | rick_rothackerj@yahoo.com | [a, a] | 2 |
2 | harriet.mcleodd@gmail.com | [a, a] | 2 |
3 | sinead.carews@gmail.com | [a, a, a] | 3 |
4 | laurence.frosty@tntech.edu | [a] | 1 |
.str.split( )[ ] - extracts a word in a column based on a delimiter(splitter) and then reference the word by index, it is similar to the SCAN function in SAS¶
This is simple version of code, however later all the functions will be explained using Regular Expressions, which are more powerful, versatile and recommended¶
#expand - Expand the splitted strings into separate columns.
#If True, return DataFrame/MultiIndex expanding dimensionality.
#If False, return Series/Index, containing lists of strings.
df["email_id"] = df.email.str.split("@", expand = True)[0]
df["email_provider"] = df.email.str.split("@", expand = True)[1]
df[["email","email_id","email_provider"]].head()
email_id | email_provider | ||
---|---|---|---|
0 | aaron.maashoh@yahoo.com | aaron.maashoh | yahoo.com |
1 | rick_rothackerj@yahoo.com | rick_rothackerj | yahoo.com |
2 | harriet.mcleodd@gmail.com | harriet.mcleodd | gmail.com |
3 | sinead.carews@gmail.com | sinead.carews | gmail.com |
4 | laurence.frosty@tntech.edu | laurence.frosty | tntech.edu |
df.email.str.split("@",expand=False)
0 [aaron.maashoh, yahoo.com] 1 [rick_rothackerj, yahoo.com] 2 [harriet.mcleodd, gmail.com] 3 [sinead.carews, gmail.com] 4 [laurence.frosty, tntech.edu] ... 994 [edward_krudyl, hotmail.com] 995 [praveen.menonw, gmail.com] 996 [ra.alpern, flossresearch.com] 997 [julia.edwardsj, sutletgroup.com] 998 [conor.humphriesk, gmail.com] Name: email, Length: 999, dtype: object
df['fname'] = df.name.str.split(expand=True)[0]
df['mname'] = df.name.str.split(expand=True)[1]
df['lname'] = df.name.str.split(expand=True)[2]
df[['name','fname','mname','lname']].head()
# please note, default delimiter in pandas is SPACE
name | fname | mname | lname | |
---|---|---|---|---|
0 | Aaron P Maashoh | Aaron | P | Maashoh |
1 | Rick D Rothackerj | Rick | D | Rothackerj |
2 | Harriet M McLeodd | Harriet | M | McLeodd |
3 | Sinead M Carews | Sinead | M | Carews |
4 | Laurence H Frosty | Laurence | H | Frosty |
df['positional_part_of_string'] = df.device_id.str[2:6]
df[['device_id','positional_part_of_string']].head()
device_id | positional_part_of_string | |
---|---|---|
0 | QVPSPJUOCKZAR | PSPJ |
1 | EOGFQPIZPYXFZ | GFQP |
2 | YSSKYOSJHPPLJ | SKYO |
3 | ATGTXKYKUDUQN | GTXK |
4 | NAUITBZFJKHWW | UITB |
.str.upper() .str.lower() .str.title()¶
df['uppercase'] = df.name.str.upper()
df['lowercase'] = df.name.str.lower()
df['propercase'] = df.name.str.title()
df[['uppercase', 'lowercase', 'propercase']].head()
uppercase | lowercase | propercase | |
---|---|---|---|
0 | AARON P MAASHOH | aaron p maashoh | Aaron P Maashoh |
1 | RICK D ROTHACKERJ | rick d rothackerj | Rick D Rothackerj |
2 | HARRIET M MCLEODD | harriet m mcleodd | Harriet M Mcleodd |
3 | SINEAD M CAREWS | sinead m carews | Sinead M Carews |
4 | LAURENCE H FROSTY | laurence h frosty | Laurence H Frosty |
.str.contains() - checks if the string contains a particular substring¶
df['cont'] = df.category.str.contains("_")
df[['category','cont']].head()
category | cont | |
---|---|---|
0 | home_essentials | True |
1 | apparels | False |
2 | electronics | False |
3 | health_care | True |
4 | home_essentials | True |
How do we apply above formula to a data seubseting¶
new_data = df[df.category.str.contains("_")]
new_data.category
0 home_essentials 3 health_care 4 home_essentials 7 health_care 10 health_care ... 982 health_care 986 home_essentials 988 health_care 991 home_essentials 993 health_care Name: category, Length: 381, dtype: object
str.replace() - to replace a part of the string with another one¶
df['email_new'] = df.email.str.replace('.com','.org',regex=False)
df['email_new'] = df.email_new.str.replace('.edu','.org',regex=False)
df[["email", "email_new"]].head()
email_new | ||
---|---|---|
0 | aaron.maashoh@yahoo.com | aaron.maashoh@yahoo.org |
1 | rick_rothackerj@yahoo.com | rick_rothackerj@yahoo.org |
2 | harriet.mcleodd@gmail.com | harriet.mcleodd@gmail.org |
3 | sinead.carews@gmail.com | sinead.carews@gmail.org |
4 | laurence.frosty@tntech.edu | laurence.frosty@tntech.org |
.replace() : It is not a string method, it provides a convenient way to use mappings or vocabularies to translate certain values. It requires a dictionary to define the mapping {from : to}.¶
df["sex_long"] = df["sex"].replace({"M": "Male", "F": "Female"})
df[["sex","sex_long"]].head()
sex | sex_long | |
---|---|---|
0 | M | Male |
1 | F | Female |
2 | M | Male |
3 | M | Male |
4 | M | Male |
.str.cat() - Concatenating string columns in small datasets¶
df['fname'] = df.name.str.split(expand=True)[0]
df['mname'] = df.name.str.split(expand=True)[1]
df['lname'] = df.name.str.split(expand=True)[2]
df[['name','fname','mname','lname']].head()
df["Full_Name"] = df.fname.str.cat(df.lname)
df[["name","fname", "lname", "Full_Name"]].head()
df["Full_Name_NEW"] = df.fname.str.cat(df.lname, sep=' ')
df[['name',"fname", "lname", "Full_Name", "Full_Name_NEW"]].head()
df["Full_Name_FULL"] = df.fname.str.cat(df.lname, sep=' ').str.cat(df.mname,sep=' ')
df[['name',"fname", "lname", "Full_Name", "Full_Name_NEW","Full_Name_FULL"]].head()
name | fname | lname | Full_Name | Full_Name_NEW | Full_Name_FULL | |
---|---|---|---|---|---|---|
0 | Aaron P Maashoh | Aaron | Maashoh | AaronMaashoh | Aaron Maashoh | Aaron Maashoh P |
1 | Rick D Rothackerj | Rick | Rothackerj | RickRothackerj | Rick Rothackerj | Rick Rothackerj D |
2 | Harriet M McLeodd | Harriet | McLeodd | HarrietMcLeodd | Harriet McLeodd | Harriet McLeodd M |
3 | Sinead M Carews | Sinead | Carews | SineadCarews | Sinead Carews | Sinead Carews M |
4 | Laurence H Frosty | Laurence | Frosty | LaurenceFrosty | Laurence Frosty | Laurence Frosty H |
"+" - Concatenating string columns in larger datasets¶
df["FULL_NAME"] = df.fname + " " + df.mname + " " + df.lname
df[["fname", "mname", "lname", "FULL_NAME"]].head()
fname | mname | lname | FULL_NAME | |
---|---|---|---|---|
0 | Aaron | P | Maashoh | Aaron P Maashoh |
1 | Rick | D | Rothackerj | Rick D Rothackerj |
2 | Harriet | M | McLeodd | Harriet M McLeodd |
3 | Sinead | M | Carews | Sinead M Carews |
4 | Laurence | H | Frosty | Laurence H Frosty |
details = {'Name' : [' Christophor', 'Robin ', ' Samantha ', ' Julia '],
'Age' : [23, 21, 22, 21],
'University' : ['BHU', 'JNU', 'DU', 'BHU'],
}
sample = pd.DataFrame(details)
sample
Name | Age | University | |
---|---|---|---|
0 | Christophor | 23 | BHU |
1 | Robin | 21 | JNU |
2 | Samantha | 22 | DU |
3 | Julia | 21 | BHU |
sample["leading_blanks"] = sample.Name.str.lstrip()
sample["trailing_blanks"] = sample.Name.str.rstrip()
sample["both_blanks"] = sample.Name.str.strip()
print(sample)
Name Age University leading_blanks trailing_blanks both_blanks 0 Christophor 23 BHU Christophor Christophor Christophor 1 Robin 21 JNU Robin Robin Robin 2 Samantha 22 DU Samantha Samantha Samantha 3 Julia 21 BHU Julia Julia Julia
.str.[start:stop:step] - This expression extracts all the characters from start to stop and by stip as − 1 is to reverse the order¶
sample['reverse'] = sample.University.str[::-1]
sample[["University", "reverse"]]
University | reverse | |
---|---|---|
0 | BHU | UHB |
1 | JNU | UNJ |
2 | DU | UD |
3 | BHU | UHB |
Module 7 : Regular Expressions (popular by name "regex")¶
To tackle textual-data-related tasks such as web scraping, sentiment analysis, and string manipulation, regex is a crucial tool to learn¶
Regular Expression is a sequence of characters that make up a pattern to find, replace, and extract textual data.¶
The built-in library in Python for handling regular expression is "re"¶
import re
1) re.findall(pattern, text) — Returns all the matched strings in a list.¶
text = 'There was further spike in negative sentiments with the outbreak.'
re.findall("the", text)
['the', 'the']
*Note : Above code searches for all 'the' in the text and returns a list of all matched strings.
Use the parameter flags=re.I or flags=re.IGNORECASE for case-insensitive¶
text = 'There was further spike in negative sentiments with the outbreak.'
re.findall("the", text, flags=re.IGNORECASE)
['The', 'the', 'the']
2) re.search(pattern, text) — Returns the first occurrence of the match as a match object.¶
text = 'There was further spike in negative sentiments with the outbreak.'
re.search("the", text)
<re.Match object; span=(13, 16), match='the'>
print(re.search("the", text).group(),"\n",
re.search("the", text).span(), "\n",
re.search("the", text).start(), "\n",
re.search("the", text).end() )
the (13, 16) 13 16
3) re.finditer(pattern, text) — Returns an iterator of match objects that we then wrap with a list.¶
text = 'There was further spike in negative sentiments with the outbreak.'
match = re.finditer('the', text, flags=re.I)
x = list(match)
x
[<re.Match object; span=(0, 3), match='The'>, <re.Match object; span=(13, 16), match='the'>, <re.Match object; span=(52, 55), match='the'>]
x[-1].span()
(52, 55)
text = 'my pincode is 201007'
re.sub('10', '**', text)
'my pincode is 20**07'
text = "We are excited to launch our new Product. #newcourse #machinelearning #python"
re.split("#", text)
['We are excited to launch our new Product. ', 'newcourse ', 'machinelearning ', 'python']
text = "We are excited to launch our new Product. #newcourse #machinelearning #python"
re.split("#", text)[1]
'newcourse '
Regex MetaCharacters¶
# Basic meta charecters used in regular expressions
# \ -> helps take then following character literally
# . -> matches a single character
#
# * -> matches 'zero' or more occourrence of the previous charecter
# + -> matches 'one' or more occourrence of the previous charecter
#
# ^ -> matches any character start of a string
# $ -> matches any character end of a string
#
# [] -> matches one of the set of characters within []
# [a-z] -> matches on of the range of characters in lowercase alphabet
# [^abc] -> matches a character that is not a, b, or c
# ? -> Matches zero or one occurrence
1) '.' the dot character, or wildcard — this matches and returns any character in the string, except a new line.¶
text = "Wow!... We're @club_199."
print(re.search(r'.', text)) # gives the first match location
print(re.findall(r'.', text)) # gives all the matches
<re.Match object; span=(0, 1), match='W'> ['W', 'o', 'w', '!', '.', '.', '.', ' ', 'W', 'e', "'", 'r', 'e', ' ', '@', 'c', 'l', 'u', 'b', '_', '1', '9', '9', '.']
When we added a '\' before '.' it changed the meaning, now it will look for literal '.'¶
text = "Wow!... We're @club_199."
print(re.search('\.', text)) # gives the first match location
print(re.findall('\.', text)) # gives all the matches
<re.Match object; span=(4, 5), match='.'> ['.', '.', '.', '.']
2) '*' (zero or more) — Matches if the previous pattern appears zero or many times.¶
# It is little tricky to inerpret it, we will help you
# in below case, when we are looking for the* , we are actually asking to look for zero or more 'e' just after 'th'
# Trick to remember : stars can we zero on a cloudy day, and on a clear sky day we can have n number of stars
text = 'There was further spike in negative sentiments with the outbreak.'
re.findall('the*', text, flags = re.I)
['The', 'the', 'th', 'the']
text = 'ac abc abbbc dabc abdc'
pattern = 'ab*c'
re.findall(pattern, text)
['ac', 'abc', 'abbbc', 'abc']
text = 'There was further spike in negative sentiments with the outbreak.'
re.findall('the+', text, flags = re.I)
['The', 'the', 'the']
text = 'ac abc abbbc dabc abdc'
pattern = 'ab+c'
re.findall(pattern, text)
['abc', 'abbbc', 'abc']
text = '7.6% in 2020 now 2022/23 budget'
re.findall(r'\d+', text)
['7', '6', '2020', '2022', '23']
4) ? (zero or once) — Matches if the previous pattern appears zero or one time only¶
text = 'There was further spike in negative sentiments with thee outbreak.'
re.findall('the?', text, flags = re.IGNORECASE)
['The', 'the', 'th', 'the']
# You might be thinking why in the above example, why it looked for
# 'the' in 'thee', while we said it looks for ONLY 0 and 1 char instance.
# Below example will make it more clear
text = 'ac abc abbbc dabc abdc'
pattern = 'ab?c'
re.findall(pattern, text)
['ac', 'abc', 'abc']
?: — matches but doesn’t capture the group with Include ?:¶
text = 'On 23th total 42%b and on 17th total 35%'
re.findall(r'(\d+)(?:%)', text)
['42', '35']
re.findall(r'(\w+)(?:l)', text)
['tota', 'tota']
5). ^ — matches only the start of a text, and therefore ^ is written as the first character in the pattern.¶
#Starts with two digits
text = '500,000 units' '90,000 hp'
pattern = '^5'
re.findall(pattern, text)
['5']
6) $ — matches the end of the string and is therefore written at the end of a pattern.¶
#Starts with two digits
text = '500,000 units' '90,000 hp'
pattern = 'hp$'
re.findall(pattern, text)
['hp']
7) [ ] — matches ANY of the characters inside the square brackets.¶
text = "We're @club_007"
re.findall('[aeiou]', text)
['e', 'e', 'u']
8) [^] Having a hat ^ character right after the opening square bracket negates the character set.¶
text = "We're @club_007"
re.findall('[^aeiou]', text)
['W', "'", 'r', ' ', '@', 'c', 'l', 'b', '_', '0', '0', '7']
Extended Regular Expressions:¶
# \d -> Any digit, equivalent to [0-9]
# \D -> Any non-digit, equivalent to [^0-9]
# \w -> Any alphanumeric, equivalent to [a-zA-Z0-9_]
# \W -> Non-alphanumeric, equivalent to [^a-zA-Z0-9_]
# \s -> Any whitespace character
# \S -> Any nonwhitespace character
#
# () -> Scoping for extraction
# {} -> Frequency for extraction
# ? -> Make a pattern non-greedy
1) \d — any digit, 0 to 9.¶
text = "We're @club_199"
re.findall('\d', text)
['1', '9', '9']
2) \D — Any non-digit. It negates \d.¶
text = "We're @club_199"
print(re.findall('\D', text))
['W', 'e', "'", 'r', 'e', ' ', '@', 'c', 'l', 'u', 'b', '_']
3) '\w' (lowercase w) — Any alphanumeric character (letter, digit, or underscore).¶
text = "We're @club_199"
print(re.findall('\w', text))
['W', 'e', 'r', 'e', 'c', 'l', 'u', 'b', '_', '1', '9', '9']
4) '\W' (uppercase w) — anything that is not \w such as spaces, and special characters.¶
text = "We're @club_199"
print(re.findall('\W', text))
["'", ' ', '@']
5) '\s' (lowercase s) — A white space.¶
text = "We're @club_007"
print(re.findall('\s', text))
[' ']
6) '\S' (uppercase s) — Negates \s. Returns anything that is not a white space.¶
text = "We're @club_007"
print(re.findall('\S', text))
['W', 'e', "'", 'r', 'e', '@', 'c', 'l', 'u', 'b', '_', '0', '0', '7']
7) () — When you write a regex pattern, you can define groups using parentheses.¶
# Below examples makes the role of '.' more clear.
# when we are looking for .o. we want to find '0' with at least char before that and one after that.
# That char can be space as well.
text = 'Once Yvonne worked for vOn on a project'
pattern = '(.o.)'
re.findall(pattern, text,flags = re.IGNORECASE)
['von', 'wor', 'for', 'vOn', ' on', 'roj']
8) {n} — Defines the exact number of times to match the previous character or pattern. e.g ‘d{3}’ matches ‘ddd’.¶
text = '7.6% in 2020 now 2022/23 budget, 99999'
re.findall('\d{4}', text)
['2020', '2022', '9999']
text = '7.6% in 2020 now 2022/23 budget, 99999'
re.findall('\d{1,4}', text)
['7', '6', '2020', '2022', '23', '9999', '9']
import pandas as pd
df = pd.DataFrame(
data= [
['@mlplus', 'We are excited to launch our new Product. #newcourse #machinelearning #python','ml.advance@mlplus.tech', '$1100,000*'],
['@kaustubhgupta', "@gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail",'kaustubh@edu.in', '0 paid'],
['@rajveer', 'Excited to lauch our new product! #newproduct #startup ','a.gupta9@twitter.me', 'less than $1000'],
['@jimmy', 'When will this coronavirus end? #thoughts','jimmy.987654321@facebook.pl', 'Amount of $8795'],
['@abhishek', 'I want to become data scientist. Any suggestions? @MLOps @dlml','abhishek@orkut.tech', '$700.50 total'],
['@ayushi', 'Missing foundational Course! @colllege','ayushi@university.org', '$40000 string created' ]
],
columns=['username', 'tweet', 'email', 'Amount'])
pd.set_option("max_colwidth", None)
df
username | tweet | Amount | ||
---|---|---|---|---|
0 | @mlplus | We are excited to launch our new Product. #newcourse #machinelearning #python | ml.advance@mlplus.tech | $1100,000* |
1 | @kaustubhgupta | @gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail | kaustubh@edu.in | 0 paid |
2 | @rajveer | Excited to lauch our new product! #newproduct #startup | a.gupta9@twitter.me | less than $1000 |
3 | @jimmy | When will this coronavirus end? #thoughts | jimmy.987654321@facebook.pl | Amount of $8795 |
4 | @abhishek | I want to become data scientist. Any suggestions? @MLOps @dlml | abhishek@orkut.tech | $700.50 total |
5 | @ayushi | Missing foundational Course! @colllege | ayushi@university.org | $40000 string created |
Removing hashtags using regex replace¶
# using replace function with regex pattern,
# regex=True and
# value as empty string
df["tweets_without_#"] = df.tweet.replace(to_replace='#\w+', regex=True, value='')
df[["tweet", "tweets_without_#"]]
tweet | tweets_without_# | |
---|---|---|
0 | We are excited to launch our new Product. #newcourse #machinelearning #python | We are excited to launch our new Product. |
1 | @gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail | @gmail Gmail is down for 30 minutes. What's the matter? |
2 | Excited to lauch our new product! #newproduct #startup | Excited to lauch our new product! |
3 | When will this coronavirus end? #thoughts | When will this coronavirus end? |
4 | I want to become data scientist. Any suggestions? @MLOps @dlml | I want to become data scientist. Any suggestions? @MLOps @dlml |
5 | Missing foundational Course! @colllege | Missing foundational Course! @colllege |
Replacing all domain suffixes with .edu using regex¶
df['email_with_edu'] = df.email.replace(to_replace=r'\.\w+', value='.edu', regex=True)
df[['email', 'email_with_edu']]
email_with_edu | ||
---|---|---|
0 | ml.advance@mlplus.tech | ml.edu@mlplus.edu |
1 | kaustubh@edu.in | kaustubh@edu.edu |
2 | a.gupta9@twitter.me | a.edu@twitter.edu |
3 | jimmy.987654321@facebook.pl | jimmy.edu@facebook.edu |
4 | abhishek@orkut.tech | abhishek@orkut.edu |
5 | ayushi@university.org | ayushi@university.edu |
df['email_with_edu'] = df.email.replace(to_replace= r'\.\w+$', value='.edu', regex=True)
df[['email', 'email_with_edu']]
email_with_edu | ||
---|---|---|
0 | ml.advance@mlplus.tech | ml.advance@mlplus.edu |
1 | kaustubh@edu.in | kaustubh@edu.edu |
2 | a.gupta9@twitter.me | a.gupta9@twitter.edu |
3 | jimmy.987654321@facebook.pl | jimmy.987654321@facebook.edu |
4 | abhishek@orkut.tech | abhishek@orkut.edu |
5 | ayushi@university.org | ayushi@university.edu |
Replace all the vowels in tweets with $¶
df['email_new'] = df.email.replace(to_replace= "[aeiouAEIOU]", regex=True, value='%')
df[['email', 'email_new']]
email_new | ||
---|---|---|
0 | ml.advance@mlplus.tech | ml.%dv%nc%@mlpl%s.t%ch |
1 | kaustubh@edu.in | k%%st%bh@%d%.%n |
2 | a.gupta9@twitter.me | %.g%pt%9@tw%tt%r.m% |
3 | jimmy.987654321@facebook.pl | j%mmy.987654321@f%c%b%%k.pl |
4 | abhishek@orkut.tech | %bh%sh%k@%rk%t.t%ch |
5 | ayushi@university.org | %y%sh%@%n%v%rs%ty.%rg |
Replace the numbers in tweets with text 00number00 using replace function and regex expressions¶
df['Amount_mask'] = df.Amount.replace(to_replace="\d+", regex=True, value='00number00')
df[['Amount', 'Amount_mask']]
Amount | Amount_mask | |
---|---|---|
0 | $1100,000* | $00number00,00number00* |
1 | 0 paid | 00number00 paid |
2 | less than $1000 | less than $00number00 |
3 | Amount of $8795 | Amount of $00number00 |
4 | $700.50 total | $00number00.00number00 total |
5 | $40000 string created | $00number00 string created |
Using regex comparison to extraxct numbers only¶
df['Amount_extracted'] = df['Amount'].str.replace(r'\D+', '', regex=True).astype('int')
df[['Amount', 'Amount_extracted']]
Amount | Amount_extracted | |
---|---|---|
0 | $1100,000* | 1100000 |
1 | 0 paid | 0 |
2 | less than $1000 | 1000 |
3 | Amount of $8795 | 8795 |
4 | $700.50 total | 70050 |
5 | $40000 string created | 40000 |
Extract all #Tag(s) only with findall and regular expression¶
df['tweet_words'] = df.tweet.str.findall(r"#(\w+)")
df[['tweet', 'tweet_words']]
tweet | tweet_words | |
---|---|---|
0 | We are excited to launch our new Product. #newcourse #machinelearning #python | [newcourse, machinelearning, python] |
1 | @gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail | [gmaildown, google, gmail] |
2 | Excited to lauch our new product! #newproduct #startup | [newproduct, startup] |
3 | When will this coronavirus end? #thoughts | [thoughts] |
4 | I want to become data scientist. Any suggestions? @MLOps @dlml | [] |
5 | Missing foundational Course! @colllege | [] |
Extract all character only with findall and regular expression¶
df['tweet_words'] = df.tweet.str.findall('[a-zA-Z]+')
df[['tweet', 'tweet_words']]
tweet | tweet_words | |
---|---|---|
0 | We are excited to launch our new Product. #newcourse #machinelearning #python | [We, are, excited, to, launch, our, new, Product, newcourse, machinelearning, python] |
1 | @gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail | [gmail, Gmail, is, down, for, minutes, What, s, the, matter, gmaildown, google, gmail] |
2 | Excited to lauch our new product! #newproduct #startup | [Excited, to, lauch, our, new, product, newproduct, startup] |
3 | When will this coronavirus end? #thoughts | [When, will, this, coronavirus, end, thoughts] |
4 | I want to become data scientist. Any suggestions? @MLOps @dlml | [I, want, to, become, data, scientist, Any, suggestions, MLOps, dlml] |
5 | Missing foundational Course! @colllege | [Missing, foundational, Course, colllege] |
Extract all non-character only with findall and regular expression¶
df['tweet_words'] = df.tweet.str.findall('[^a-zA-Z]+')
df[['tweet', 'tweet_words']]
tweet | tweet_words | |
---|---|---|
0 | We are excited to launch our new Product. #newcourse #machinelearning #python | [ , , , , , , , . #, #, #] |
1 | @gmail Gmail is down for 30 minutes. What's the matter? #gmaildown #google #gmail | [@, , , , , 30 , . , ', , , ? #, #, #] |
2 | Excited to lauch our new product! #newproduct #startup | [ , , , , , ! #, #, ] |
3 | When will this coronavirus end? #thoughts | [ , , , , ? #] |
4 | I want to become data scientist. Any suggestions? @MLOps @dlml | [ , , , , , . , , ? @, @] |
5 | Missing foundational Course! @colllege | [ , , ! @] |
Verify a set of strings with-in text¶
df['email_new'] = df['email'].str.contains(r'@?.tech')
df[['email', 'email_new']]
email_new | ||
---|---|---|
0 | ml.advance@mlplus.tech | True |
1 | kaustubh@edu.in | False |
2 | a.gupta9@twitter.me | False |
3 | jimmy.987654321@facebook.pl | False |
4 | abhishek@orkut.tech | True |
5 | ayushi@university.org | False |
A simple project using Regex¶
Email verification with Regex¶
data = {'Customer_id': ['cus_01', 'cus_02', 'cus_03', 'cus_04'],
'email': ['name.surname@gmail.com','anonymous123@yahoo.co.ukk','anonymous123@.com','a..@gmail.com']
}
df = pd.DataFrame(data)
df
Customer_id | ||
---|---|---|
0 | cus_01 | name.surname@gmail.com |
1 | cus_02 | anonymous123@yahoo.co.ukk |
2 | cus_03 | anonymous123@.com |
3 | cus_04 | a..@gmail.com |
regex_regular = re.compile(r'\w+[._-]*\w*@\w+\.[\w\.]{2,7}')
# How to read above regular expression? Here you go:
# (r'
# \w+ -> Atleast one or more Alphanumeric values should be there
# [._-]* --> Zero ore more . - or _ should be there
# \w* --> above two can follow by one or zero more alphanumeric
# @ --> Now after above strings an '@' should be there
# \w+ --> Atleast one or more Alphanumeric values should be there
# \. --> Now a period should be there
# [\w\.]{2,7} --> then should be followed by a Alphnumeric string of 2 to 7 length
# ')
re.fullmatch(pattern, string, flags). This method returns a match object only if the whole string matches the pattern¶
def isValid_regular(x):
if re.fullmatch(regex_regular, x):
y = 'Valid email'
else:
y = 'Invalid email'
return y
df['Is_ Valid_regular'] = df['email'].apply(isValid_regular)
df
Customer_id | Is_ Valid_regular | ||
---|---|---|---|
0 | cus_01 | name.surname@gmail.com | Valid email |
1 | cus_02 | anonymous123@yahoo.co.ukk | Valid email |
2 | cus_03 | anonymous123@.com | Invalid email |
3 | cus_04 | a..@gmail.com | Valid email |
regex_gmail = re.compile(r'\w+[._-]*\w*@gmail.com')
# How to read above regular expression? Here you go:
# (r'
# \w+ -> Atleast one or more Alphanumeric values should be there
# [._-]* --> Zero ore more . - or _ should be there
# \w* --> above two can follow by one or zero more alphanumeric
# @gmail\.com --> above all stuff followed by a '@gmail.com' will make it a valid gmail id, right?
# ')
def isValid_gmail(x):
if re.fullmatch(regex_gmail, x):
y = 'Valid email'
else:
y = 'Invalid email'
return y
df['Is_ Valid_gmail'] = df['email'].apply(isValid_gmail)
df
Customer_id | Is_ Valid_regular | Is_ Valid_gmail | ||
---|---|---|---|---|
0 | cus_01 | name.surname@gmail.com | Valid email | Valid email |
1 | cus_02 | anonymous123@yahoo.co.ukk | Valid email | Invalid email |
2 | cus_03 | anonymous123@.com | Invalid email | Invalid email |
3 | cus_04 | a..@gmail.com | Valid email | Valid email |
Now we can break the string as well, using regular expressions¶
pattern = '(\w+[._-]*\w*)@(\w+)\.([\w\.]{2,7})'
text = 'rajat.sas@gmail.com'
y = re.findall(pattern, text,flags = re.I)[0][0]
y
'rajat.sas'
def email_part1(x):
try:
pattern = '(\w+[._-]*\w*)@(\w+)\.([\w\.]{2,7})'
y = re.findall(pattern, x,flags = re.I)[0][0]
except:
y = 'Something went wrong'
return y
def email_domain(x):
try:
pattern = '(\w+[._-]*\w*)@(\w+)\.([\w\.]{2,7})'
y = re.findall(pattern, x,flags = re.I)[0][1]
except:
y = 'Something went wrong'
return y
print(email_part1(x = 'rajat.sas@gmail.com'), "\n", email_domain(x = 'rajat.sas@gmail.com'))
rajat.sas gmail
df['email_ID'] = df['email'].apply(email_part1)
df['Email_Domain'] = df['email'].apply(email_domain)
df
Customer_id | Is_ Valid_regular | Is_ Valid_gmail | email_ID | Email_Domain | ||
---|---|---|---|---|---|---|
0 | cus_01 | name.surname@gmail.com | Valid email | Valid email | name.surname | gmail |
1 | cus_02 | anonymous123@yahoo.co.ukk | Valid email | Invalid email | anonymous123 | yahoo |
2 | cus_03 | anonymous123@.com | Invalid email | Invalid email | Something went wrong | Something went wrong |
3 | cus_04 | a..@gmail.com | Valid email | Valid email | a.. | gmail |
def email_part1(x):
pattern = '(\w+[._-]*\w*)@(\w+)\.([\w\.]{2,7})'
y = re.findall(pattern, x,flags = re.I)[0][0]
return y
email_part1(x = 'rajat.sas@gmail.com')
'rajat.sas'
Don't worry!!! We will explain "apply" and 'def' stuff later in a separate blog, when we will cover UDF - User Defined Functions¶
Module 8 : Handling the Date Time¶
Unix epoch (00:00:00 UTC on 1 January 1970)¶
It is similar to base date is SAS
import pandas as pd
epoch_time = [0, 1,
60, 3600,
7200, 3600*24]
# Pandas pd.to_datetime() is able to convert any valid date string to datetime
pd.to_datetime(epoch_time, unit = 's')
DatetimeIndex(['1970-01-01 00:00:00', '1970-01-01 00:00:01', '1970-01-01 00:01:00', '1970-01-01 01:00:00', '1970-01-01 02:00:00', '1970-01-02 00:00:00'], dtype='datetime64[ns]', freq=None)
# 1 second = 1000 milliseconds
epoch_time = 1001
pd.to_datetime(epoch_time, unit = 'ms')
Timestamp('1970-01-01 00:00:01.001000')
import pandas as pd
epoch_time = [0, 1,
60, 3600,
7200, 3600*24]
# Pandas pd.to_datetime() is able to convert any valid date string to datetime
pd.to_datetime(epoch_time, unit = 's')
DatetimeIndex(['1970-01-01 00:00:00', '1970-01-01 00:00:01', '1970-01-01 00:01:00', '1970-01-01 01:00:00', '1970-01-01 02:00:00', '1970-01-02 00:00:00'], dtype='datetime64[ns]', freq=None)
# 1 second = 1000 milliseconds
epoch_time = 1001
pd.to_datetime(epoch_time, unit = 'ms')
Timestamp('1970-01-01 00:00:01.001000')
The unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or float number. Default is ‘ns’¶
dob = [-1 ,0, 1, 364, 10000, 19400]
pd.to_datetime(dob, unit='D')
DatetimeIndex(['1969-12-31', '1970-01-01', '1970-01-02', '1970-12-31', '1997-05-19', '2023-02-12'], dtype='datetime64[ns]', freq=None)
# Create a list of dates with different formats
doj = ['2023-03-25', 'Feb 3 2023','2023.02.03', '2023/02/03', '20230203', '02/03/2023', '03/02/2023']
pd.to_datetime(doj,infer_datetime_format=True)
DatetimeIndex(['2023-03-25', '2023-02-03', '2023-02-03', '2023-02-03', '2023-02-03', '2023-02-03', '2023-03-02'], dtype='datetime64[ns]', freq=None)
Customize date format¶
# The format parameter will instruct Pandas how to interpret your strings when converting them to DateTime objects.
pd.to_datetime('2022~05~15', format = '%Y~%m~%d')
Timestamp('2022-05-15 00:00:00')
pd.to_datetime('15/05/2022', format = '%d/%m/%Y')
Timestamp('2022-05-15 00:00:00')
DateTime Format Codes¶
#Invalid dates
#pd.to_datetime(['2-13-2023', 'Feb 13', '2023', '123'])
pd.to_datetime(['2-13-2023', 'Feb 13', '2023', '123'], errors='ignore')
Index(['2-13-2023', 'Feb 13', '2023', '123'], dtype='object')
pd.to_datetime(['2-13-2023', 'Feb 13', '2023', '123'], errors='coerce')
DatetimeIndex(['2023-02-13', 'NaT', '2023-01-01', 'NaT'], dtype='datetime64[ns]', freq=None)
# importing data
import pandas as pd
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df = df[["user_id", "purchase_value", "signup_time", "dob"]]
df.head(2)
user_id | purchase_value | signup_time | dob | |
---|---|---|---|---|
0 | 22058 | 65278 | 2/24/2015 22:55 | 2/22/1976 |
1 | 333320 | 96399 | 6/7/2015 20:39 | 1/2/1962 |
# Check the data-types
df.dtypes
user_id int64 purchase_value int64 signup_time object dob object dtype: object
Pandas pd.to_datetime() is able to parse any valid date string to datetime¶
df["signup_time"] = pd.to_datetime(df['signup_time'])
df["dob"] = pd.to_datetime(df['dob'] )
df.dtypes
user_id int64 purchase_value int64 signup_time datetime64[ns] dob datetime64[ns] dtype: object
df.head(2)
user_id | purchase_value | signup_time | dob | |
---|---|---|---|---|
0 | 22058 | 65278 | 2015-02-24 22:55:00 | 1976-02-22 |
1 | 333320 | 96399 | 2015-06-07 20:39:00 | 1962-01-02 |
How to change the format of the date¶
df['dob_new'] = df.dob.dt.strftime('%d/%m/%Y')
df.head(2)
user_id | purchase_value | signup_time | dob | dob_new | |
---|---|---|---|---|---|
0 | 22058 | 65278 | 2015-02-24 22:55:00 | 1976-02-22 | 22/02/1976 |
1 | 333320 | 96399 | 2015-06-07 20:39:00 | 1962-01-02 | 02/01/1962 |
df.dtypes
user_id int64 purchase_value int64 signup_time datetime64[ns] dob datetime64[ns] dob_new object dtype: object
start and end date of the time series data¶
print("Minimum of dob :" , df.dob.min(), "\n"
"Maximum of dob :", df.dob.max(), "\n"
"Difference of min and max :", df.dob.max() - df.dob.min())
Minimum of dob : 1946-06-03 00:00:00 Maximum of dob : 1997-10-14 00:00:00 Difference of min and max : 18761 days 00:00:00
Get year, month, and day¶
df['signup_year'] = df['signup_time'].dt.year
df['signup_month'] = df['signup_time'].dt.month
df['signup_date'] = df['signup_time'].dt.month
df.head()
user_id | purchase_value | signup_time | dob | dob_new | signup_year | signup_month | signup_date | |
---|---|---|---|---|---|---|---|---|
0 | 22058 | 65278 | 2015-02-24 22:55:00 | 1976-02-22 | 22/02/1976 | 2015 | 2 | 2 |
1 | 333320 | 96399 | 2015-06-07 20:39:00 | 1962-01-02 | 02/01/1962 | 2015 | 6 | 6 |
2 | 1359 | 57296 | 2015-01-01 18:52:00 | 1962-03-05 | 05/03/1962 | 2015 | 1 | 1 |
3 | 150084 | 43650 | 2015-04-28 21:13:00 | 1974-07-03 | 03/07/1974 | 2015 | 4 | 4 |
4 | 221365 | 45016 | 2015-07-21 07:09:00 | 1970-08-25 | 25/08/1970 | 2015 | 7 | 7 |
Get the week of year, the day of week and leap year¶
df['signup_week_of_year'] = df['signup_time'].dt.isocalendar().week
df['signup_day_of_week'] = df['signup_time'].dt.dayofweek
df['signup_is_leap_year'] = df['signup_time'].dt.is_leap_year
df.head()
user_id | purchase_value | signup_time | dob | dob_new | signup_year | signup_month | signup_date | signup_week_of_year | signup_day_of_week | signup_is_leap_year | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 65278 | 2015-02-24 22:55:00 | 1976-02-22 | 22/02/1976 | 2015 | 2 | 2 | 9 | 1 | False |
1 | 333320 | 96399 | 2015-06-07 20:39:00 | 1962-01-02 | 02/01/1962 | 2015 | 6 | 6 | 23 | 6 | False |
2 | 1359 | 57296 | 2015-01-01 18:52:00 | 1962-03-05 | 05/03/1962 | 2015 | 1 | 1 | 1 | 3 | False |
3 | 150084 | 43650 | 2015-04-28 21:13:00 | 1974-07-03 | 03/07/1974 | 2015 | 4 | 4 | 18 | 1 | False |
4 | 221365 | 45016 | 2015-07-21 07:09:00 | 1970-08-25 | 25/08/1970 | 2015 | 7 | 7 | 30 | 1 | False |
df_1 = df.signup_time.apply(lambda x: x.day_name())
df_1.head()
0 Tuesday 1 Sunday 2 Thursday 3 Tuesday 4 Tuesday Name: signup_time, dtype: object
df_2 = df.signup_time.apply(lambda x: x.month_name())
df_2.head()
0 February 1 June 2 January 3 April 4 July Name: signup_time, dtype: object
# Monday is denoted by 0 and ends on Sunday which is denoted by 6
wd_mapping={
0: 'Monday',
1: 'Tuesday',
2: 'Wednesday',
3: 'Thursday',
4: 'Friday',
5: 'Saturday',
6: 'Sunday'
}
Get the week of year, the day of week and leap year¶
df['signup_week_of_year'] = df['signup_time'].dt.isocalendar().week
df['signup_day_of_week'] = df['signup_time'].dt.dayofweek
df['signup_is_leap_year'] = df['signup_time'].dt.is_leap_year
df.head()
user_id | purchase_value | signup_time | dob | dob_new | signup_year | signup_month | signup_date | signup_week_of_year | signup_day_of_week | signup_is_leap_year | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 65278 | 2015-02-24 22:55:00 | 1976-02-22 | 22/02/1976 | 2015 | 2 | 2 | 9 | 1 | False |
1 | 333320 | 96399 | 2015-06-07 20:39:00 | 1962-01-02 | 02/01/1962 | 2015 | 6 | 6 | 23 | 6 | False |
2 | 1359 | 57296 | 2015-01-01 18:52:00 | 1962-03-05 | 05/03/1962 | 2015 | 1 | 1 | 1 | 3 | False |
3 | 150084 | 43650 | 2015-04-28 21:13:00 | 1974-07-03 | 03/07/1974 | 2015 | 4 | 4 | 18 | 1 | False |
4 | 221365 | 45016 | 2015-07-21 07:09:00 | 1970-08-25 | 25/08/1970 | 2015 | 7 | 7 | 30 | 1 | False |
df_1 = df.signup_time.apply(lambda x: x.day_name())
df_1.head()
0 Tuesday 1 Sunday 2 Thursday 3 Tuesday 4 Tuesday Name: signup_time, dtype: object
df_2 = df.signup_time.apply(lambda x: x.month_name())
df_2.head()
0 February 1 June 2 January 3 April 4 July Name: signup_time, dtype: object
# Monday is denoted by 0 and ends on Sunday which is denoted by 6
wd_mapping={
0: 'Monday',
1: 'Tuesday',
2: 'Wednesday',
3: 'Thursday',
4: 'Friday',
5: 'Saturday',
6: 'Sunday'
}
df['signup_day_of_week']=df['signup_time'].dt.weekday.map(wd_mapping)
df.head()
user_id | purchase_value | signup_time | dob | dob_new | signup_year | signup_month | signup_date | signup_week_of_year | signup_day_of_week | signup_is_leap_year | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 65278 | 2015-02-24 22:55:00 | 1976-02-22 | 22/02/1976 | 2015 | 2 | 2 | 9 | Tuesday | False |
1 | 333320 | 96399 | 2015-06-07 20:39:00 | 1962-01-02 | 02/01/1962 | 2015 | 6 | 6 | 23 | Sunday | False |
2 | 1359 | 57296 | 2015-01-01 18:52:00 | 1962-03-05 | 05/03/1962 | 2015 | 1 | 1 | 1 | Thursday | False |
3 | 150084 | 43650 | 2015-04-28 21:13:00 | 1974-07-03 | 03/07/1974 | 2015 | 4 | 4 | 18 | Tuesday | False |
4 | 221365 | 45016 | 2015-07-21 07:09:00 | 1970-08-25 | 25/08/1970 | 2015 | 7 | 7 | 30 | Tuesday | False |
Get the age from the date of birth with the help of pd.to_datetime('today')¶
today = pd.to_datetime('today')
df['age'] = today.year - df['dob'].dt.year
df.head()
user_id | purchase_value | signup_time | dob | dob_new | signup_year | signup_month | signup_date | signup_week_of_year | signup_day_of_week | signup_is_leap_year | age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 65278 | 2015-02-24 22:55:00 | 1976-02-22 | 22/02/1976 | 2015 | 2 | 2 | 9 | Tuesday | False | 47.0 |
1 | 333320 | 96399 | 2015-06-07 20:39:00 | 1962-01-02 | 02/01/1962 | 2015 | 6 | 6 | 23 | Sunday | False | 61.0 |
2 | 1359 | 57296 | 2015-01-01 18:52:00 | 1962-03-05 | 05/03/1962 | 2015 | 1 | 1 | 1 | Thursday | False | 61.0 |
3 | 150084 | 43650 | 2015-04-28 21:13:00 | 1974-07-03 | 03/07/1974 | 2015 | 4 | 4 | 18 | Tuesday | False | 49.0 |
4 | 221365 | 45016 | 2015-07-21 07:09:00 | 1970-08-25 | 25/08/1970 | 2015 | 7 | 7 | 30 | Tuesday | False | 53.0 |
Difference between two dates¶
#Age at the time of signup
df['age_at_signup'] = df['signup_time'].dt.year - df['dob'].dt.year
df.head()
user_id | purchase_value | signup_time | dob | dob_new | signup_year | signup_month | signup_date | signup_week_of_year | signup_day_of_week | signup_is_leap_year | age | age_at_signup | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 65278 | 2015-02-24 22:55:00 | 1976-02-22 | 22/02/1976 | 2015 | 2 | 2 | 9 | Tuesday | False | 47.0 | 39.0 |
1 | 333320 | 96399 | 2015-06-07 20:39:00 | 1962-01-02 | 02/01/1962 | 2015 | 6 | 6 | 23 | Sunday | False | 61.0 | 53.0 |
2 | 1359 | 57296 | 2015-01-01 18:52:00 | 1962-03-05 | 05/03/1962 | 2015 | 1 | 1 | 1 | Thursday | False | 61.0 | 53.0 |
3 | 150084 | 43650 | 2015-04-28 21:13:00 | 1974-07-03 | 03/07/1974 | 2015 | 4 | 4 | 18 | Tuesday | False | 49.0 | 41.0 |
4 | 221365 | 45016 | 2015-07-21 07:09:00 | 1970-08-25 | 25/08/1970 | 2015 | 7 | 7 | 30 | Tuesday | False | 53.0 | 45.0 |
import numpy as np
#create new columns that contains date differences
df['age_at_signup_days'] = (df['signup_time'] - df['dob']) / np.timedelta64(1, 'D')
df['age_at_signup_weeks'] = (df['signup_time'] - df['dob']) / np.timedelta64(1, 'W')
df['age_at_signup_months'] = (df['signup_time'] - df['dob']) / np.timedelta64(1, 'M')
df['age_at_signup_years'] = (df['signup_time'] - df['dob']) / np.timedelta64(1, 'Y')
df.head()
user_id | purchase_value | signup_time | dob | dob_new | signup_year | signup_month | signup_date | signup_week_of_year | signup_day_of_week | signup_is_leap_year | age | age_at_signup | age_at_signup_days | age_at_signup_weeks | age_at_signup_months | age_at_signup_years | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 65278 | 2015-02-24 22:55:00 | 1976-02-22 | 22/02/1976 | 2015 | 2 | 2 | 9 | Tuesday | False | 47.0 | 39.0 | 14247.954861 | 2035.422123 | 468.114905 | 39.009575 |
1 | 333320 | 96399 | 2015-06-07 20:39:00 | 1962-01-02 | 02/01/1962 | 2015 | 6 | 6 | 23 | Sunday | False | 61.0 | 53.0 | 19514.860417 | 2787.837202 | 641.158477 | 53.429873 |
2 | 1359 | 57296 | 2015-01-01 18:52:00 | 1962-03-05 | 05/03/1962 | 2015 | 1 | 1 | 1 | Thursday | False | 61.0 | 53.0 | 19295.786111 | 2756.540873 | 633.960816 | 52.830068 |
3 | 150084 | 43650 | 2015-04-28 21:13:00 | 1974-07-03 | 03/07/1974 | 2015 | 4 | 4 | 18 | Tuesday | False | 49.0 | 41.0 | 14909.884028 | 2129.983433 | 489.862511 | 40.821876 |
4 | 221365 | 45016 | 2015-07-21 07:09:00 | 1970-08-25 | 25/08/1970 | 2015 | 7 | 7 | 30 | Tuesday | False | 53.0 | 45.0 | 16401.297917 | 2343.042560 | 538.862742 | 44.905228 |
Get the datepart and timepart from datetime¶
df['signup_datepart'] = df['signup_time'].dt.date
df['signup_timepart'] = df['signup_time'].dt.time
df[['signup_time', 'signup_datepart', 'signup_timepart']] .head()
signup_time | signup_datepart | signup_timepart | |
---|---|---|---|
0 | 2015-02-24 22:55:00 | 2015-02-24 | 22:55:00 |
1 | 2015-06-07 20:39:00 | 2015-06-07 | 20:39:00 |
2 | 2015-01-01 18:52:00 | 2015-01-01 | 18:52:00 |
3 | 2015-04-28 21:13:00 | 2015-04-28 | 21:13:00 |
4 | 2015-07-21 07:09:00 | 2015-07-21 | 07:09:00 |
set the date column as the index¶
df = df.set_index(['signup_time'])
df.head()
#Reset index to default
#df = df.reset_index(level=0)
user_id | purchase_value | dob | dob_new | signup_year | signup_month | signup_date | signup_week_of_year | signup_day_of_week | signup_is_leap_year | age | age_at_signup | age_at_signup_days | age_at_signup_weeks | age_at_signup_months | age_at_signup_years | signup_datepart | signup_timepart | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
signup_time | ||||||||||||||||||
2015-02-24 22:55:00 | 22058 | 65278 | 1976-02-22 | 22/02/1976 | 2015 | 2 | 2 | 9 | Tuesday | False | 47.0 | 39.0 | 14247.954861 | 2035.422123 | 468.114905 | 39.009575 | 2015-02-24 | 22:55:00 |
2015-06-07 20:39:00 | 333320 | 96399 | 1962-01-02 | 02/01/1962 | 2015 | 6 | 6 | 23 | Sunday | False | 61.0 | 53.0 | 19514.860417 | 2787.837202 | 641.158477 | 53.429873 | 2015-06-07 | 20:39:00 |
2015-01-01 18:52:00 | 1359 | 57296 | 1962-03-05 | 05/03/1962 | 2015 | 1 | 1 | 1 | Thursday | False | 61.0 | 53.0 | 19295.786111 | 2756.540873 | 633.960816 | 52.830068 | 2015-01-01 | 18:52:00 |
2015-04-28 21:13:00 | 150084 | 43650 | 1974-07-03 | 03/07/1974 | 2015 | 4 | 4 | 18 | Tuesday | False | 49.0 | 41.0 | 14909.884028 | 2129.983433 | 489.862511 | 40.821876 | 2015-04-28 | 21:13:00 |
2015-07-21 07:09:00 | 221365 | 45016 | 1970-08-25 | 25/08/1970 | 2015 | 7 | 7 | 30 | Tuesday | False | 53.0 | 45.0 | 16401.297917 | 2343.042560 | 538.862742 | 44.905228 | 2015-07-21 | 07:09:00 |
df = df.reset_index(level=0)
df.head()
signup_time | user_id | purchase_value | dob | dob_new | signup_year | signup_month | signup_date | signup_week_of_year | signup_day_of_week | signup_is_leap_year | age | age_at_signup | age_at_signup_days | age_at_signup_weeks | age_at_signup_months | age_at_signup_years | signup_datepart | signup_timepart | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-02-24 22:55:00 | 22058 | 65278 | 1976-02-22 | 22/02/1976 | 2015 | 2 | 2 | 9 | Tuesday | False | 47.0 | 39.0 | 14247.954861 | 2035.422123 | 468.114905 | 39.009575 | 2015-02-24 | 22:55:00 |
1 | 2015-06-07 20:39:00 | 333320 | 96399 | 1962-01-02 | 02/01/1962 | 2015 | 6 | 6 | 23 | Sunday | False | 61.0 | 53.0 | 19514.860417 | 2787.837202 | 641.158477 | 53.429873 | 2015-06-07 | 20:39:00 |
2 | 2015-01-01 18:52:00 | 1359 | 57296 | 1962-03-05 | 05/03/1962 | 2015 | 1 | 1 | 1 | Thursday | False | 61.0 | 53.0 | 19295.786111 | 2756.540873 | 633.960816 | 52.830068 | 2015-01-01 | 18:52:00 |
3 | 2015-04-28 21:13:00 | 150084 | 43650 | 1974-07-03 | 03/07/1974 | 2015 | 4 | 4 | 18 | Tuesday | False | 49.0 | 41.0 | 14909.884028 | 2129.983433 | 489.862511 | 40.821876 | 2015-04-28 | 21:13:00 |
4 | 2015-07-21 07:09:00 | 221365 | 45016 | 1970-08-25 | 25/08/1970 | 2015 | 7 | 7 | 30 | Tuesday | False | 53.0 | 45.0 | 16401.297917 | 2343.042560 | 538.862742 | 44.905228 | 2015-07-21 | 07:09:00 |
Subset data between two dates¶
df["signup_time_new"] = pd.to_datetime(df['signup_time'])
df.dtypes
signup_time datetime64[ns] user_id int64 purchase_value int64 dob datetime64[ns] dob_new object signup_year int64 signup_month int64 signup_date int64 signup_week_of_year UInt32 signup_day_of_week object signup_is_leap_year bool age float64 age_at_signup float64 age_at_signup_days float64 age_at_signup_weeks float64 age_at_signup_months float64 age_at_signup_years float64 signup_datepart object signup_timepart object signup_time_new datetime64[ns] dtype: object
subset = df[(df['signup_time_new'] >= '2015-02-15') & (df['signup_time_new'] <= '2015-02-28')]
subset.head()
signup_time | user_id | purchase_value | dob | dob_new | signup_year | signup_month | signup_date | signup_week_of_year | signup_day_of_week | signup_is_leap_year | age | age_at_signup | age_at_signup_days | age_at_signup_weeks | age_at_signup_months | age_at_signup_years | signup_datepart | signup_timepart | signup_time_new | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-02-24 22:55:00 | 22058 | 65278 | 1976-02-22 | 22/02/1976 | 2015 | 2 | 2 | 9 | Tuesday | False | 47.0 | 39.0 | 14247.954861 | 2035.422123 | 468.114905 | 39.009575 | 2015-02-24 | 22:55:00 | 2015-02-24 22:55:00 |
14 | 2015-02-16 02:50:00 | 82931 | 71616 | 1991-10-21 | 21/10/1991 | 2015 | 2 | 2 | 8 | Monday | False | 32.0 | 24.0 | 8519.118056 | 1217.016865 | 279.894636 | 23.324553 | 2015-02-16 | 02:50:00 | 2015-02-16 02:50:00 |
43 | 2015-02-19 12:31:00 | 250432 | 59053 | 1975-04-06 | 06/04/1975 | 2015 | 2 | 2 | 8 | Thursday | False | 48.0 | 40.0 | 14564.521528 | 2080.645933 | 478.515667 | 39.876306 | 2015-02-19 | 12:31:00 | 2015-02-19 12:31:00 |
61 | 2015-02-20 18:29:00 | 221091 | 86357 | NaT | NaN | 2015 | 2 | 2 | 8 | Friday | False | NaN | NaN | NaN | NaN | NaN | NaN | 2015-02-20 | 18:29:00 | 2015-02-20 18:29:00 |
102 | 2015-02-16 03:40:00 | 259616 | 98184 | 1992-07-09 | 09/07/1992 | 2015 | 2 | 2 | 8 | Monday | False | 31.0 | 23.0 | 8257.152778 | 1179.593254 | 271.287797 | 22.607316 | 2015-02-16 | 03:40:00 | 2015-02-16 03:40:00 |
Aggregate over indexed datetime column¶
# Get the total of purchase_value in 2015
df[df['signup_year'] == 2015].purchase_value.sum()
50941499
df.groupby('signup_day_of_week')['purchase_value'].sum()
signup_day_of_week Friday 8514166 Monday 7088526 Saturday 6432936 Sunday 8482116 Thursday 6621175 Tuesday 6403681 Wednesday 7398899 Name: purchase_value, dtype: int64
df.dob = df.dob.dt.strftime('%d/%m/%Y')
df.dtypes
signup_time datetime64[ns] user_id int64 purchase_value int64 dob object dob_new object signup_year int64 signup_month int64 signup_date int64 signup_week_of_year UInt32 signup_day_of_week object signup_is_leap_year bool age float64 age_at_signup float64 age_at_signup_days float64 age_at_signup_weeks float64 age_at_signup_months float64 age_at_signup_years float64 signup_datepart object signup_timepart object signup_time_new datetime64[ns] dtype: object
df.head(2)
signup_time | user_id | purchase_value | dob | dob_new | signup_year | signup_month | signup_date | signup_week_of_year | signup_day_of_week | signup_is_leap_year | age | age_at_signup | age_at_signup_days | age_at_signup_weeks | age_at_signup_months | age_at_signup_years | signup_datepart | signup_timepart | signup_time_new | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-02-24 22:55:00 | 22058 | 65278 | 22/02/1976 | 22/02/1976 | 2015 | 2 | 2 | 9 | Tuesday | False | 47.0 | 39.0 | 14247.954861 | 2035.422123 | 468.114905 | 39.009575 | 2015-02-24 | 22:55:00 | 2015-02-24 22:55:00 |
1 | 2015-06-07 20:39:00 | 333320 | 96399 | 02/01/1962 | 02/01/1962 | 2015 | 6 | 6 | 23 | Sunday | False | 61.0 | 53.0 | 19514.860417 | 2787.837202 | 641.158477 | 53.429873 | 2015-06-07 | 20:39:00 | 2015-06-07 20:39:00 |
#Reset index to default
df = df.reset_index(level=0)
df.head()
index | signup_time | user_id | purchase_value | dob | dob_new | signup_year | signup_month | signup_date | signup_week_of_year | ... | signup_is_leap_year | age | age_at_signup | age_at_signup_days | age_at_signup_weeks | age_at_signup_months | age_at_signup_years | signup_datepart | signup_timepart | signup_time_new | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2015-02-24 22:55:00 | 22058 | 65278 | 22/02/1976 | 22/02/1976 | 2015 | 2 | 2 | 9 | ... | False | 47.0 | 39.0 | 14247.954861 | 2035.422123 | 468.114905 | 39.009575 | 2015-02-24 | 22:55:00 | 2015-02-24 22:55:00 |
1 | 1 | 2015-06-07 20:39:00 | 333320 | 96399 | 02/01/1962 | 02/01/1962 | 2015 | 6 | 6 | 23 | ... | False | 61.0 | 53.0 | 19514.860417 | 2787.837202 | 641.158477 | 53.429873 | 2015-06-07 | 20:39:00 | 2015-06-07 20:39:00 |
2 | 2 | 2015-01-01 18:52:00 | 1359 | 57296 | 05/03/1962 | 05/03/1962 | 2015 | 1 | 1 | 1 | ... | False | 61.0 | 53.0 | 19295.786111 | 2756.540873 | 633.960816 | 52.830068 | 2015-01-01 | 18:52:00 | 2015-01-01 18:52:00 |
3 | 3 | 2015-04-28 21:13:00 | 150084 | 43650 | 03/07/1974 | 03/07/1974 | 2015 | 4 | 4 | 18 | ... | False | 49.0 | 41.0 | 14909.884028 | 2129.983433 | 489.862511 | 40.821876 | 2015-04-28 | 21:13:00 | 2015-04-28 21:13:00 |
4 | 4 | 2015-07-21 07:09:00 | 221365 | 45016 | 25/08/1970 | 25/08/1970 | 2015 | 7 | 7 | 30 | ... | False | 53.0 | 45.0 | 16401.297917 | 2343.042560 | 538.862742 | 44.905228 | 2015-07-21 | 07:09:00 | 2015-07-21 07:09:00 |
5 rows × 21 columns
DataFrame.DateOffset()¶
# Adding or subtracting days
df['signup_time_end']=df['signup_time']+pd.DateOffset(days=365)
df[['signup_time' , 'signup_time_end']].head()
signup_time | signup_time_end | |
---|---|---|
0 | 2015-02-24 22:55:00 | 2016-02-24 22:55:00 |
1 | 2015-06-07 20:39:00 | 2016-06-06 20:39:00 |
2 | 2015-01-01 18:52:00 | 2016-01-01 18:52:00 |
3 | 2015-04-28 21:13:00 | 2016-04-27 21:13:00 |
4 | 2015-07-21 07:09:00 | 2016-07-20 07:09:00 |
# Adding Year(s)
df['signup_time_end_2']=df['signup_time']+pd.DateOffset(years=2)
df[['signup_time' , 'signup_time_end_2']].head()
signup_time | signup_time_end_2 | |
---|---|---|
0 | 2015-02-24 22:55:00 | 2017-02-24 22:55:00 |
1 | 2015-06-07 20:39:00 | 2017-06-07 20:39:00 |
2 | 2015-01-01 18:52:00 | 2017-01-01 18:52:00 |
3 | 2015-04-28 21:13:00 | 2017-04-28 21:13:00 |
4 | 2015-07-21 07:09:00 | 2017-07-21 07:09:00 |
Start/End of previous quarter/month/year¶
import calendar
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
first_day, last_day = calendar.monthrange(2022,2)
print(first_day,last_day)
1 28
sample_date = date.today()
sample_date
datetime.date(2023, 4, 6)
# you can use today or a refernece date
CMB = sample_date + relativedelta(day = 1)
CMB
datetime.date(2023, 4, 1)
CME = sample_date + relativedelta(day = 31)
CME
datetime.date(2023, 4, 30)
# Please note: use'months' not 'month'
PMB = (sample_date + relativedelta(months = -1))+ relativedelta(day = 1)
PMB
datetime.date(2023, 3, 1)
PME = (sample_date + relativedelta(months = -1))+ relativedelta(day = 31)
PME
datetime.date(2023, 3, 31)
# Please note: use'years' not 'year'
PYB = date((sample_date + relativedelta(years = -1)).year,1,1)
PYB
datetime.date(2022, 1, 1)
PYE = date((sample_date + relativedelta(years = -1)).year,12,31)
PYE
datetime.date(2022, 12, 31)
def PQB(ref_date,n):
PREVIOUS_QTR_STG1 = (sample_date + relativedelta(months = 1*n*3))+ relativedelta(day = 1)
PREVIOUS_QTR_STG1_Month = PREVIOUS_QTR_STG1.month
PREVIOUS_QTR_STG1_year = PREVIOUS_QTR_STG1.year
if PREVIOUS_QTR_STG1_Month in (1,2,3):
x = 1
elif PREVIOUS_QTR_STG1_Month in (4,5,6):
x = 4
elif PREVIOUS_QTR_STG1_Month in (7,8,9):
x = 7
elif PREVIOUS_QTR_STG1_Month in (10,11,12):
x = 10
PREVIOUS_QTR_STG2 = date(PREVIOUS_QTR_STG1_year,x,1)
return PREVIOUS_QTR_STG2
n = -2
sample_date = date(2023,3,10)
PQB(sample_date,n)
datetime.date(2022, 7, 1)
def PQE(ref_date,n):
PREVIOUS_QTR_STG1 = (sample_date + relativedelta(months = 1*n*3))+ relativedelta(day = 1)
PREVIOUS_QTR_STG1_Month = PREVIOUS_QTR_STG1.month
PREVIOUS_QTR_STG1_year = PREVIOUS_QTR_STG1.year
if PREVIOUS_QTR_STG1_Month in (1,2,3):
x = 3
elif PREVIOUS_QTR_STG1_Month in (4,5,6):
x = 6
elif PREVIOUS_QTR_STG1_Month in (7,8,9):
x = 9
elif PREVIOUS_QTR_STG1_Month in (10,11,12):
x = 12
PREVIOUS_QTR_STG2 = date(PREVIOUS_QTR_STG1_year,x,1)+ relativedelta(day = 31)
return PREVIOUS_QTR_STG2
n = -2
sample_date = date(2023,3,10)
PQE(sample_date,n)
datetime.date(2022, 9, 30)
Let's try to change the time-zones of the date fields¶
import pandas as pd
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv", usecols = ['user_id','signup_time'])
df["signup_time"] = pd.to_datetime(df['signup_time'])
df.head(5)
user_id | signup_time | |
---|---|---|
0 | 22058 | 2015-02-24 22:55:00 |
1 | 333320 | 2015-06-07 20:39:00 |
2 | 1359 | 2015-01-01 18:52:00 |
3 | 150084 | 2015-04-28 21:13:00 |
4 | 221365 | 2015-07-21 07:09:00 |
# set the current time as of Asia
df['EST_timezone'] = df.signup_time.dt.tz_localize('EST')
df['MST_timezone'] = df.EST_timezone.dt.tz_convert('MST')
df['IST_timezone'] = df.EST_timezone.dt.tz_convert('Asia/Calcutta')
df.head(5)
user_id | signup_time | EST_timezone | MST_timezone | IST_timezone | |
---|---|---|---|---|---|
0 | 22058 | 2015-02-24 22:55:00 | 2015-02-24 22:55:00-05:00 | 2015-02-24 20:55:00-07:00 | 2015-02-25 09:25:00+05:30 |
1 | 333320 | 2015-06-07 20:39:00 | 2015-06-07 20:39:00-05:00 | 2015-06-07 18:39:00-07:00 | 2015-06-08 07:09:00+05:30 |
2 | 1359 | 2015-01-01 18:52:00 | 2015-01-01 18:52:00-05:00 | 2015-01-01 16:52:00-07:00 | 2015-01-02 05:22:00+05:30 |
3 | 150084 | 2015-04-28 21:13:00 | 2015-04-28 21:13:00-05:00 | 2015-04-28 19:13:00-07:00 | 2015-04-29 07:43:00+05:30 |
4 | 221365 | 2015-07-21 07:09:00 | 2015-07-21 07:09:00-05:00 | 2015-07-21 05:09:00-07:00 | 2015-07-21 17:39:00+05:30 |
Module 9 : Arithmetic Operations on Pandas DataFrame¶
#importing a csv file with pd.read_csv
import pandas as pd
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df.head(2)
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 732758368.8 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
1 | 333320 | 6/7/2015 20:39 | 6/8/2015 1:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 350311387.9 | 0 | apparels | 1/2/1962 | Rick D Rothackerj | rick_rothackerj@yahoo.com |
Create a New Columns with Arithmetic Operators : + - * /¶
discount = 0.21
df['discount'] = df['purchase_value'] * discount
df['actual_purchase_value'] = df['purchase_value'] - df['discount']
df[['purchase_value', 'discount', 'actual_purchase_value']].head()
purchase_value | discount | actual_purchase_value | |
---|---|---|---|
0 | 65278 | 13708.38 | 51569.62 |
1 | 96399 | 20243.79 | 76155.21 |
2 | 57296 | 12032.16 | 45263.84 |
3 | 43650 | 9166.50 | 34483.50 |
4 | 45016 | 9453.36 | 35562.64 |
Create lags and leads of a column in a pandas DataFrame¶
#importing a simple csv file with pd.read_csv
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv",usecols = ['user_id','purchase_value','purchase_time'])
df.sort_values(['user_id'],ascending= [True],inplace=True,ignore_index=True)
df.head()
user_id | purchase_time | purchase_value | |
---|---|---|---|
0 | 170 | 8/26/2015 21:25 | 39879 |
1 | 753 | 5/25/2015 21:28 | 67119 |
2 | 1359 | 1/1/2015 18:52 | 57296 |
3 | 1361 | 7/29/2015 0:53 | 73482 |
4 | 2048 | 4/12/2015 12:23 | 30576 |
Lag shifts a column down by a certain number with .shift()¶
df['lag_1_user_id'] = df['user_id'].shift(1)
df['lag_2_user_id'] = df['user_id'].shift(2)
df[['user_id', 'lag_1_user_id', 'lag_2_user_id']].head()
user_id | lag_1_user_id | lag_2_user_id | |
---|---|---|---|
0 | 170 | NaN | NaN |
1 | 753 | 170.0 | NaN |
2 | 1359 | 753.0 | 170.0 |
3 | 1361 | 1359.0 | 753.0 |
4 | 2048 | 1361.0 | 1359.0 |
lead shifts a column up by a certain number with .shift()¶
df['lead_1_user_id'] = df['user_id'].shift(-1)
df['lead_2_user_id'] = df['user_id'].shift(-2)
df[['user_id', 'lead_1_user_id', 'lead_2_user_id']].head()
user_id | lead_1_user_id | lead_2_user_id | |
---|---|---|---|
0 | 170 | 753.0 | 1359.0 |
1 | 753 | 1359.0 | 1361.0 |
2 | 1359 | 1361.0 | 2048.0 |
3 | 1361 | 2048.0 | 2462.0 |
4 | 2048 | 2462.0 | 2474.0 |
with numpy library : .apply(np.ceil) AND .apply(np.floor)¶
with numpy library : .apply(np.ceil) AND .apply(np.floor)¶
import numpy as np
df['purchase_value_USD'] = df['purchase_value']/82
df['purchase_value_USD_Ceil'] = df['purchase_value'].apply(np.ceil)
df['purchase_value_USD_Floor'] = df['purchase_value'].apply(np.floor)
df[['purchase_value_USD', 'purchase_value_USD_Ceil', 'purchase_value_USD_Floor']].head()
purchase_value_USD | purchase_value_USD_Ceil | purchase_value_USD_Floor | |
---|---|---|---|
0 | 486.329268 | 39879.0 | 39879.0 |
1 | 818.524390 | 67119.0 | 67119.0 |
2 | 698.731707 | 57296.0 | 57296.0 |
3 | 896.121951 | 73482.0 | 73482.0 |
4 | 372.878049 | 30576.0 | 30576.0 |
with math library - .apply(math.ceil) AND .apply(math.floor)¶
import math
df['purchase_value_USD'] = df['purchase_value']/82
df['purchase_value_USD_Ceil'] = df['purchase_value_USD'].apply(math.ceil)
df['purchase_value_USD_Floor'] = df['purchase_value_USD'].apply(math.floor)
df[["purchase_value_USD", "purchase_value_USD_Ceil", "purchase_value_USD_Floor"]].head()
purchase_value_USD | purchase_value_USD_Ceil | purchase_value_USD_Floor | |
---|---|---|---|
0 | 486.329268 | 487 | 486 |
1 | 818.524390 | 819 | 818 |
2 | 698.731707 | 699 | 698 |
3 | 896.121951 | 897 | 896 |
4 | 372.878049 | 373 | 372 |
df['rounded_off'] = round(df['purchase_value_USD'])
df['rounded_off_single_decimal'] = round(df['purchase_value_USD'],1)
df['rounded_off_double_decimal'] = round(df['purchase_value_USD'],2)
df[['purchase_value_USD', 'rounded_off', 'rounded_off_single_decimal', 'rounded_off_double_decimal']].head()
purchase_value_USD | rounded_off | rounded_off_single_decimal | rounded_off_double_decimal | |
---|---|---|---|---|
0 | 486.329268 | 486.0 | 486.3 | 486.33 |
1 | 818.524390 | 819.0 | 818.5 | 818.52 |
2 | 698.731707 | 699.0 | 698.7 | 698.73 |
3 | 896.121951 | 896.0 | 896.1 | 896.12 |
4 | 372.878049 | 373.0 | 372.9 | 372.88 |
.round(decimal=) as a method¶
df['rounded_off_single_decimal'] = df['purchase_value_USD'].round(decimals=1)
df['rounded_off_double_decimal'] = df['purchase_value_USD'].round(decimals=2)
df[['purchase_value_USD', 'rounded_off_double_decimal', 'rounded_off_double_decimal']].head()
purchase_value_USD | rounded_off_double_decimal | rounded_off_double_decimal | |
---|---|---|---|
0 | 486.329268 | 486.33 | 486.33 |
1 | 818.524390 | 818.52 | 818.52 |
2 | 698.731707 | 698.73 | 698.73 |
3 | 896.121951 | 896.12 | 896.12 |
4 | 372.878049 | 372.88 | 372.88 |
df["Remainder"] = df['purchase_value_USD'].mod(100)
df[["purchase_value_USD", "Remainder"]].head()
purchase_value_USD | Remainder | |
---|---|---|
0 | 486.329268 | 86.329268 |
1 | 818.524390 | 18.524390 |
2 | 698.731707 | 98.731707 |
3 | 896.121951 | 96.121951 |
4 | 372.878049 | 72.878049 |
df['log_purchase_value'] = df['purchase_value'].apply(math.log)
df['log2_purchase_value'] = df['purchase_value'].apply(math.log2)
df['log10_purchase_value'] = df['purchase_value'].apply(math.log10)
df[['purchase_value', 'log_purchase_value', 'log2_purchase_value', 'log10_purchase_value']].head()
purchase_value | log_purchase_value | log2_purchase_value | log10_purchase_value | |
---|---|---|---|---|
0 | 39879 | 10.593605 | 15.283342 | 4.600744 |
1 | 67119 | 11.114222 | 16.034434 | 4.826845 |
2 | 57296 | 10.955986 | 15.806147 | 4.758124 |
3 | 73482 | 11.204796 | 16.165103 | 4.866181 |
4 | 30576 | 10.327971 | 14.900112 | 4.485381 |
Module 10 : query() Method for Pandas DataFrame¶
#importing a csv file with pd.read_csv
import pandas as pd
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df.head(3)
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 7.327584e+08 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
1 | 333320 | 6/7/2015 20:39 | 6/8/2015 1:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 3.503114e+08 | 0 | apparels | 1/2/1962 | Rick D Rothackerj | rick_rothackerj@yahoo.com |
2 | 1359 | 1/1/2015 18:52 | 1/1/2015 18:52 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 3/5/1962 | Harriet M McLeodd | harriet.mcleodd@gmail.com |
df.query("category == 'electronics'").head(2)
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 1359 | 1/1/2015 18:52 | 1/1/2015 18:52 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 3/5/1962 | Harriet M McLeodd | harriet.mcleodd@gmail.com |
5 | 159135 | 5/21/2015 6:03 | 7/9/2015 8:05 | 37124 | ALEYXFXINSXLZ | Ads | Chrome | M | 2.809315e+09 | 0 | electronics | 4/8/1997 | Lee Z Chyenz | lee#chyenz@gmail.com |
Multiple conditions with and , or¶
df.query("category == 'electronics' and purchase_value >= 50000").head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 1359 | 1/1/2015 18:52 | 1/1/2015 18:52 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 3/5/1962 | Harriet M McLeodd | harriet.mcleodd@gmail.com |
6 | 50116 | 8/1/2015 22:40 | 8/27/2015 3:37 | 75000 | IWKVZHJOCLPUR | Ads | Chrome | F | 3.987484e+09 | 0 | electronics | 5/1/1996 | Dhanya J Skariachano | dhanya.skariachano@gmail.com |
31 | 396746 | 8/5/2015 23:50 | 8/14/2015 12:08 | 83290 | JUWCOHMHRBDCL | SEO | IE | M | 2.937614e+09 | 0 | electronics | 8/16/1959 | Steve U Slatery | steve.slatery@gmail.com |
32 | 243817 | 2/13/2015 18:16 | 4/23/2015 17:44 | 64864 | YIIEFBZUGEXNR | Direct | Chrome | F | 2.521871e+09 | 0 | electronics | 2/8/1983 | ita D Bosed | ita.bosed@yahoo.com |
42 | 55002 | 6/9/2015 0:32 | 9/17/2015 13:45 | 85743 | QLHWTRKJZWMAW | SEO | FireFox | M | 3.184465e+09 | 0 | electronics | 8/1/1996 | Jessica I Toonkelx | jessica.toonkelx@sutletgroup.com |
df.query("category == 'electronics' or purchase_value >= 50000").head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 7.327584e+08 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
1 | 333320 | 6/7/2015 20:39 | 6/8/2015 1:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 3.503114e+08 | 0 | apparels | 1/2/1962 | Rick D Rothackerj | rick_rothackerj@yahoo.com |
2 | 1359 | 1/1/2015 18:52 | 1/1/2015 18:52 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 3/5/1962 | Harriet M McLeodd | harriet.mcleodd@gmail.com |
5 | 159135 | 5/21/2015 6:03 | 7/9/2015 8:05 | 37124 | ALEYXFXINSXLZ | Ads | Chrome | M | 2.809315e+09 | 0 | electronics | 4/8/1997 | Lee Z Chyenz | lee#chyenz@gmail.com |
6 | 50116 | 8/1/2015 22:40 | 8/27/2015 3:37 | 75000 | IWKVZHJOCLPUR | Ads | Chrome | F | 3.987484e+09 | 0 | electronics | 5/1/1996 | Dhanya J Skariachano | dhanya.skariachano@gmail.com |
df.query("category in ['electronics','apparels']").head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 333320 | 6/7/2015 20:39 | 6/8/2015 1:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 3.503114e+08 | 0 | apparels | 1/2/1962 | Rick D Rothackerj | rick_rothackerj@yahoo.com |
2 | 1359 | 1/1/2015 18:52 | 1/1/2015 18:52 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 3/5/1962 | Harriet M McLeodd | harriet.mcleodd@gmail.com |
5 | 159135 | 5/21/2015 6:03 | 7/9/2015 8:05 | 37124 | ALEYXFXINSXLZ | Ads | Chrome | M | 2.809315e+09 | 0 | electronics | 4/8/1997 | Lee Z Chyenz | lee#chyenz@gmail.com |
6 | 50116 | 8/1/2015 22:40 | 8/27/2015 3:37 | 75000 | IWKVZHJOCLPUR | Ads | Chrome | F | 3.987484e+09 | 0 | electronics | 5/1/1996 | Dhanya J Skariachano | dhanya.skariachano@gmail.com |
8 | 159045 | 4/21/2015 23:38 | 6/2/2015 14:01 | 66311 | ILXYDOZIHOOHT | SEO | IE | F | 3.719094e+09 | 0 | apparels | 1/22/1972 | Andreas B Cremerf | andreas.cremerf@sutletgroup.com |
# Query between a range of values
df.query("purchase_value >= 80000 and purchase_value <= 90000").head(2)
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13 | 299320 | 3/3/2015 19:17 | 4/5/2015 12:32 | 84732 | RMKQNVEWGTWPC | Direct | Safari | M | 3.178510e+09 | 0 | cosmetics | 10/15/1977 | Rachelle K Younglaic | rachelle.younglaic@tntech.edu |
15 | 31383 | 2/1/2015 1:06 | 3/24/2015 10:17 | 84796 | UNUAVQXNWFBMO | SEO | Safari | F | 9.957328e+08 | 0 | home_essentials | 8/26/1991 | Nate F Raymondj | nate.raymondj@gmail.com |
df.query('name.str[1] == "a"').head(2)
# df.query('name.str[1].str.lower() == "a"').head(2)
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 7.327584e+08 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
2 | 1359 | 1/1/2015 18:52 | 1/1/2015 18:52 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 3/5/1962 | Harriet M McLeodd | harriet.mcleodd@gmail.com |
.str.slice() to check the portion with-in a text column¶
df.query('name.str.slice(1,4).str.lower() == "arr"').head(2)
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 1359 | 1/1/2015 18:52 | 1/1/2015 18:52 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 3/5/1962 | Harriet M McLeodd | harriet.mcleodd@gmail.com |
62 | 222633 | 4/18/2015 14:33 | 6/3/2015 0:43 | 35386 | CYULQPBDHLIBY | SEO | FireFox | M | 1.438733e+09 | 0 | home_essentials | 8/4/1974 | Warren B Strobels | warren.strobels@yahoo.com |
List of Operators which can be used : '==' , '!=', '>' , '>=', '<' , '<=', isna(), in(), notin()¶
Module 11 : Running SQL Queries on Pandas DataFrame - PANDASQL¶
We can query pandas DataFrame directly using SQL queries¶
#!pip install pandasql
import pandas as pd
import numpy as np
import pandasql as ps
df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan, 5230],
[1234, 'Customer A', np.nan, '333 Street',8345],
[1233, 'Customer B', '444 Street', '333 Street', 3456],
[1233, 'Customer B', '444 Street', '666 Street', 4578]],
columns = ['ID', 'Customer', 'Billing Address', 'Shipping Address', 'Payment'])
use the 'sqldf' function of pandasql('ps') library - ps.sqldf¶
ps.sqldf('''
select * from df limit 2
''')
ID | Customer | Billing Address | Shipping Address | Payment | |
---|---|---|---|---|---|
0 | 1234 | Customer A | 123 Street | None | 5230 |
1 | 1234 | Customer A | None | 333 Street | 8345 |
ps.sqldf('''select distinct ID from df''')
ID | |
---|---|
0 | 1234 |
1 | 1233 |
SQL GROUP BY¶
ps.sqldf('''
select ID, sum(Payment)
from df
group by ID
''')
ID | sum(Payment) | |
---|---|---|
0 | 1233 | 8034 |
1 | 1234 | 13575 |
SQL Order BY¶
ps.sqldf('''select ID, sum(Payment) as Total_Payment
from df
group by ID
order by Total_Payment ''')
ID | Total_Payment | |
---|---|---|
0 | 1233 | 8034 |
1 | 1234 | 13575 |
SQL having Clause¶
ps.sqldf('''
select ID, sum(Payment) as Total_Payment
from df
group by ID
having Total_Payment > 10000
''')
ID | Total_Payment | |
---|---|---|
0 | 1234 | 13575 |
Create a DataFrmae from sql query¶
df_id = ps.sqldf('''select distinct id, Customer from df''')
df_id
ID | Customer | |
---|---|---|
0 | 1234 | Customer A |
1 | 1233 | Customer B |
Module 11 : Python to Oracle, SQL Server, MySQL, and PostgreSQL¶
1) Connection to Oracle and fetch the first 10 rows from the famous employees table of HR schema¶
Note: how Oracle uses ROWNUM, instead of TOP, or LIMIT
#install cx_Oracle library through pip
#! pip install cx_Oracle
# import cx_Oracle
# dsn = cx_Oracle.makedsn(
# 'localhost',
# '1521',
# service_name='orcl'
# )
# conn = cx_Oracle.connect(
# user='username',
# password='password',
# dsn=dsn
# )
# c = conn.cursor()
# c.execute('SELECT * FROM employees WHERE ROWNUM <= 10')
# for row in c: print(row)
# conn.close()
# Note: The makedsn() function accepts the database hostname, the port number, and the service name.
#The cx_Oracle.makedsn() function can be used to construct a connect descriptor string for
# the data source name parameter 'dsn' of cx_Oracle.connect()
connect(): Now Establish a connection between the Python program and Oracle database by using connect() function.¶
cursor(): To execute a SQL query and to provide results some special object is required that is nothing but cursor() object.¶
cursor.execute('sqlquery') - - - -> to execute a SQL query.¶
commit(): For DML(Data Manipulation Language) queries that comprise operations like update, insert, delete. We need to commit() then only the result reflects in the database.¶
close(): After all done it is mandatory to close all operations.¶
2) SQL Server Connection¶
#install cx_Oracle library through pip
#! pip install pyodbc
# import pyodbc
# conn = pyodbc.connect(
# 'Driver={SQL Server};'
# 'Server=DESKTOP-TLF7IMQ\SQLEXPRESS;'
# 'Database=retail;'
# 'Trusted_Connection=yes;'
# )
# cursor = conn.cursor()
# cursor.execute('SELECT TOP 5 * FROM databasename.tablename')
# for row in cursor: print(row)
# conn.close()
3) MySQL Connection¶
# Python installed through Anaconda already come with necessary libraries built-in,
# so there’s no need for additional installations.
# To connect from Python, use mysql library:
# import mysql.connector
# conn = mysql.connector.connect(
# host='localhost',
# user='root',
# passwd='1234'
# )
# cursor = conn.cursor()
# cursor.execute('SELECT * FROM sakila.actor LIMIT 5')
# for row in cursor: print(row)
# conn.close()
#Note: By default, MySQL instance comes with sakila database
4) PostgreSQL Connection¶
To connect to PostgreSQL we will need to install psycopg2 library:
#!pip install psycopg2
# The database instance contains the earlier mentioned dvdrental database,
# so let's connect to it, once again to the actor table:
# import psycopg2
# conn = psycopg2.connect(
# user='postgres',
# password='1234',
# host='127.0.0.1',
# port='5432',
# database='dvdrental'
# )
# cursor = conn.cursor()
# cursor.execute('SELECT * FROM actor LIMIT 10')
# for row in cursor: print(row)
# conn.close()
Module 12 : User-Defined Functions (UDF) and Lambda Functions¶
def multiply():
a = 2
b = 3
x = a * b
return x
Calling a function¶
multiply()
6
UDF with Parameters/Arguments¶
Example1:¶
def multiply(a, b):
return(a*b)
multiply(125, 7)
875
Example2: A funtion to get (x+y)^2¶
def nsquare(x, y):
a = (x+y)**2
return a
nsquare(2,3)
25
Example3: A function to calculate the area of a circle¶
def calculate_area(radius):
pi = 22/7
area = pi * radius * radius
return area
calculate_area(9)
254.57142857142856
Example 4: A function to check if a number is even¶
def is_even(number):
if number%2 == 0:
return True
else:
return False
is_even(34)
True
Example 5: A function to calculate the sum of a list of numbers¶
def calculate_sum(numbers):
total = 0
for number in numbers:
total = total + number
return total
This function takes a list of numbers as input, iterates over each number in the list, adds it to a running total, and returns the final sum.
listing = [10,20,30,40,50,100]
calculate_sum(listing)
250
Types of arguments¶
1). Positional Arguments – These arguments are position specific. Thus, while calling a a function always pass the arguments in the same order as it is defined while creating the function.
Example 1: Calculating Simple Interest¶
def simple_interest(principal,time,rate):
interest=(principal * time * rate)/100
amount = interest + principal
return interest, amount
simple_interest(100000,3,12)
(36000.0, 136000.0)
Example 2: Calcuating Compound Interest and Maturity Amount¶
def compound_interest(principal, rate, time):
Amount = principal
for i in range(time):
Amount = Amount * (1 + rate/100)
CI = Amount - principal
return CI , Amount
compound_interest(100000,3,12)
(42576.08868461792, 142576.08868461792)
Example 3: Create a function to calculate simple interest over a period of n year and put it in a pandas dataframe.¶
def calculate_simple_interest(principal, rate, n):
# Calculate the interest rate per year
interest_rate = rate / 100
# Calculate the simple interest for each year
interest_list = []
for year in range(1, n+1):
interest = principal * interest_rate * year
interest_list.append(interest)
# Create a pandas DataFrame with the results
df = pd.DataFrame({
'Year': range(1, n+1),
'Interest': interest_list
})
return df
df = calculate_simple_interest(100000, 7.5, 4)
print(df)
Year Interest 0 1 7500.0 1 2 15000.0 2 3 22500.0 3 4 30000.0
2). Default Arguments – In some cases we have a default value that can be used only if the value for that parameter is not passed in the function calling statement. Default parameters must be present after the non-default parameters in the function header otherwise it will give an error.
def simple_interest(principal, time, rate=12):
interest = (principal * time * rate)/100
amount = interest + principal
return interest, amount
simple_interest(100000, 3)
(36000.0, 136000.0)
3). Keywords Arguments – These arguments are name(keyword) specific. Here, in the calling functions we can use the exact name of the parameters to call the function and it need not be in the same order as it is defined while creating the function.
def simple_interest(principal, time, rate):
interest = (principal * time * rate)/100
amount = interest + principal
return interest, amount
simple_interest(time=3, rate=12, principal=100000)
(36000.0, 136000.0)
4). Variable-length arguments – This type of argument allows us to pass any length of argument in the calling statement of the function according to our needs so there is no need to mention a fixed number of parameters in the function header.
def summation(*length):
total=0
for i in length:
total+=i
print("Total=",total)
summation(12, 13, 20)
Total= 45
summation(45,87,78,34,56,19,201)
Total= 520
Some more examples¶
#1. Write a function that inputs a number and prints the multiplication table of that number
def multiplication_table(number):
for i in range(1, 11):
print("{multiplier} * {multiplicand} = {multiplicantion}"
.format(multiplier=number, multiplicand=i, multiplicantion=number * i))
multiplication_table(13)
13 * 1 = 13 13 * 2 = 26 13 * 3 = 39 13 * 4 = 52 13 * 5 = 65 13 * 6 = 78 13 * 7 = 91 13 * 8 = 104 13 * 9 = 117 13 * 10 = 130
'Applying' User defined function(UDF) in pandas dataframes¶
#importing a csv file with pd.read_csv
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df.head(2)
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 732758368.8 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
1 | 333320 | 6/7/2015 20:39 | 6/8/2015 1:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 350311387.9 | 0 | apparels | 1/2/1962 | Rick D Rothackerj | rick_rothackerj@yahoo.com |
Example 1: create a new categorical(string) variable from a continous(numeric) variable¶
def fun(x):
if x <= 25000:
x = 'Below Normal'
elif x <= 50000:
x = 'Normal'
elif x <= 75000:
x = 'Above Normal'
else:
x = 'High'
return x
df['purchase_value_category'] = df['purchase_value'].apply(fun)
df[["purchase_value", "purchase_value_category"]].head()
purchase_value | purchase_value_category | |
---|---|---|
0 | 65278 | Above Normal |
1 | 96399 | High |
2 | 57296 | Above Normal |
3 | 43650 | Normal |
4 | 45016 | Normal |
df['purchase_value_category'].value_counts()
Above Normal 274 High 257 Below Normal 240 Normal 228 Name: purchase_value_category, dtype: int64
Example 2: Import multiple csv files in pandas datafram using user defined function¶
Create a function that takes a list of file names as an argument and uses the pandas read_csv() function to read each CSV file and concatenate them into a single DataFrame.
def concatenate_csv_files(file_list):
# Initialize an empty list to store the DataFrames
df_list = []
# Loop through each file in the list and read it into a DataFrame
for file in file_list:
df = pd.read_csv(file)
df_list.append(df)
# Concatenate the DataFrames into a single DataFrame
concatenated_df = pd.concat(df_list, ignore_index=True)
return concatenated_df
In this function, we loop over the list of file names and use pd.read_csv() to read each CSV file and create a DataFrame. We then append the DataFrame to a list. Finally, we use pd.concat() to concatenate all the DataFrames in the list into a single DataFrame.
Note: The "axis" parameter of "pd.concat()" is set to "0", which means that the DataFrames are concatenated vertically (i.e., row-wise). If you want to concatenate them horizontally (i.e., column-wise), you can set axis to "1".
We can also modify this function to include additional parameters that are passed to "pd.read_csv()", such as delimiter, header, etc., depending on your specific requirements.
file_list = ['files_3_Concat/file1.csv', 'files_3_Concat/file2.csv', 'files_3_Concat/file3.csv']
df = concatenate_csv_files(file_list)
df.head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 24-02-2015 22:55 | 18-04-2015 02:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 7.327584e+08 | 0 | home_essentials | 22-02-1976 |
1 | 333320 | 07-06-2015 20:39 | 08-06-2015 01:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 3.503114e+08 | 0 | apparels | 02-01-1962 |
2 | 1359 | 01-01-2015 18:52 | 01-01-2015 18:52 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 05-03-1962 |
3 | 150084 | 28-04-2015 21:13 | 04-05-2015 13:54 | 43650 | ATGTXKYKUDUQN | SEO | Safari | M | 3.840542e+09 | 0 | health_care | 03-07-1974 |
4 | 221365 | 21-07-2015 07:09 | 09-09-2015 18:40 | 45016 | NAUITBZFJKHWW | Ads | Safari | M | 4.155831e+08 | 0 | home_essentials | 25-08-1970 |
def get_csv_file_lengths(file_list):
# Initialize an empty list to store the (file_name, num_rows) tuples
lengths_list = []
# Loop through each file in the list and get the number of rows
for file_name in file_list:
df = pd.read_csv(file_name)
num_rows = len(df)
lengths_list.append((file_name, num_rows))
return lengths_list
get_csv_file_lengths(file_list)
[('files_3_Concat/file1.csv', 471), ('files_3_Concat/file2.csv', 580), ('files_3_Concat/file3.csv', 439)]
df.head(2)
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 24-02-2015 22:55 | 18-04-2015 02:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 7.327584e+08 | 0 | home_essentials | 22-02-1976 |
1 | 333320 | 07-06-2015 20:39 | 08-06-2015 01:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 3.503114e+08 | 0 | apparels | 02-01-1962 |
Lambda function in pandas¶
- A lambda function in pandas is a small anonymous function that can be used to perform simple operations on pandas DataFrames or Series
- The lambda function is defined using the lambda keyword, followed by the input parameters and the operation to be performed.
- Lambda functions are typically used for one-off operations that do not need to be defined as a separate function.
Unlike user defined function which can we used time and again, it is for one time usage, like OTP (one time password)
Example 1: use of Lamda function to calculates the fee charged on each purchase value in df, and adds the fee values as a new column named "Fee" to the DataFrame¶
df['Fee'] = df['purchase_value']*(1-0.91)
df[['user_id', 'purchase_value', 'Fee']].head()
user_id | purchase_value | Fee | |
---|---|---|---|
0 | 22058 | 65278 | 5875.02 |
1 | 333320 | 96399 | 8675.91 |
2 | 1359 | 57296 | 5156.64 |
3 | 150084 | 43650 | 3928.50 |
4 | 221365 | 45016 | 4051.44 |
df['Fee'] = df['purchase_value'].map(lambda x: x - (x*0.91))
df[['user_id', 'purchase_value', 'Fee']].head()
user_id | purchase_value | Fee | |
---|---|---|---|
0 | 22058 | 65278 | 5875.02 |
1 | 333320 | 96399 | 8675.91 |
2 | 1359 | 57296 | 5156.64 |
3 | 150084 | 43650 | 3928.50 |
4 | 221365 | 45016 | 4051.44 |
Example 2: lambda function to create a new column called 'Bonus', which is 10% of each employee's salary.¶
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 70000, 90000]})
df
Name | Age | Salary | |
---|---|---|---|
0 | Alice | 25 | 50000 |
1 | Bob | 30 | 70000 |
2 | Charlie | 35 | 90000 |
df['Bonus'] = df['Salary']*0.1
df
Name | Age | Salary | Bonus | |
---|---|---|---|---|
0 | Alice | 25 | 50000 | 5000.0 |
1 | Bob | 30 | 70000 | 7000.0 |
2 | Charlie | 35 | 90000 | 9000.0 |
df['Bonus'] = df['Salary'].apply(lambda x: x * 0.1)
df
Name | Age | Salary | Bonus | |
---|---|---|---|---|
0 | Alice | 25 | 50000 | 5000.0 |
1 | Bob | 30 | 70000 | 7000.0 |
2 | Charlie | 35 | 90000 | 9000.0 |
Here, we used the apply method to apply a lambda function to each row of the 'Salary' column. The lambda function takes the input parameter x, which represents the value of each element in the 'Salary' column. The apply method returns a new Series object, which we assign to the 'Bonus' column of the original DataFrame.
Example 3: lambda function with filter in pandas to filter rows based on a condition.¶
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 70000, 90000]})
# Use lambda function with filter to filter rows where Salary > 60000
filtered_df = df.loc[lambda df: df['Salary'] > 60000]
filtered_df
Name | Age | Salary | |
---|---|---|---|
1 | Bob | 30 | 70000 |
2 | Charlie | 35 | 90000 |
Here, we used the .loc with a lambda function to apply the filter to the DataFrame.
Example 4: use groupby and then apply a lambda function to filter the grouped data.¶
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'Department': ['Sales', 'Marketing', 'Sales', 'Marketing', 'Sales'],
'Salary': [50000, 70000, 90000, 60000, 80000],
'Bonus': [7000, 9000, 4000, 3000, 1000]})
df
# Use groupby with lambda to get average salary by department
avg_salary_by_dept = df.groupby('Department')[['Salary','Bonus']].apply(lambda x: x.mean())
avg_salary_by_dept
Salary | Bonus | |
---|---|---|
Department | ||
Marketing | 65000.000000 | 6000.0 |
Sales | 73333.333333 | 4000.0 |
Example 5: use multiple columns to groupby the data and apply a lambda function.¶
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'Department': ['Sales', 'Marketing', 'Sales', 'Marketing', 'Sales'],
'Gender': ['F', 'M', 'M', 'M', 'F'],
'Salary': [50000, 70000, 90000, 60000, 80000]})
# Use groupby with lambda to get average salary by department and gender
avg_salary_by_dept_gender = df.groupby(['Department', 'Gender'])['Salary'].apply(lambda x: x.mean())
avg_salary_by_dept_gender
Department Gender Marketing M 65000.0 Sales F 65000.0 M 90000.0 Name: Salary, dtype: float64
Important - We can apply any function inside lambda, even user-defined function, and that makes it quite useful¶
def first_3(x):
y = x[0:3]
return y
df['Dept_short'] = df['Department'].apply(lambda x: first_3(x).upper())
df.head()
Name | Department | Gender | Salary | Dept_short | |
---|---|---|---|---|---|
0 | Alice | Sales | F | 50000 | SAL |
1 | Bob | Marketing | M | 70000 | MAR |
2 | Charlie | Sales | M | 90000 | SAL |
3 | David | Marketing | M | 60000 | MAR |
4 | Emily | Sales | F | 80000 | SAL |
Example 6: use multiple if conditions in a lambda function in pandas by using the ; 'and' and 'or' logical operators.¶
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'Department': ['Sales', 'Marketing', 'Sales', 'Marketing', 'Sales'],
'Salary': [50000, 70000, 90000, 60000, 80000]})
# Use lambda function with multiple if conditions to create a new column 'Bonus'
df['Bonus'] = df['Salary'].apply(lambda x: 500 if x > 70000
else
200 if x >= 60000 and x <= 70000
else 0)
df
Name | Department | Salary | Bonus | |
---|---|---|---|---|
0 | Alice | Sales | 50000 | 0 |
1 | Bob | Marketing | 70000 | 200 |
2 | Charlie | Sales | 90000 | 500 |
3 | David | Marketing | 60000 | 200 |
4 | Emily | Sales | 80000 | 500 |
In Pandas map(), apply(), and applymap() are methods used to transform data¶
- map() is used to transform values in a Series or a single column of a DataFrame. It takes a dictionary or a function as an argument and applies the function or mapping to the Series. The result is a new Series with the transformed values.
- apply() is used to apply a function to each row or column of a DataFrame. It can be used to perform more complex transformations on the data. It can also be used to apply functions that take multiple arguments, by passing additional arguments as keyword arguments.
- applymap() is used to apply a function element-wise to every element in a DataFrame. It is similar to map, but it applies the function to every element in the DataFrame instead of just a single column.
Example 1: Using map() to transform values in a Series¶
names = pd.Series(['Alice', 'Bob', 'Charlie'])
# using map to transform the values
first_three_letters = names.map(lambda x: x[0:3])
first_three_letters
0 Ali 1 Bob 2 Cha dtype: object
Example 2: Using apply() to apply a function to each row or column of a DataFrame¶
data = {
'store1': [10, 20, 30],
'store2': [5, 10, 15],
'store3': [20, 30, 40],
'store4': [40, 50, None]
}
df = pd.DataFrame(data, index=['product1', 'product2', 'product3'])
df
store1 | store2 | store3 | store4 | |
---|---|---|---|---|
product1 | 10 | 5 | 20 | 40.0 |
product2 | 20 | 10 | 30 | 50.0 |
product3 | 30 | 15 | 40 | NaN |
# using apply to apply a function to each row, try with axis = 0 as well
df_applied= df.apply(lambda row: row.sum(), axis=1)
df_applied
product1 75.0 product2 110.0 product3 85.0 dtype: float64
Example 3: Using applymap() to apply a function element-wise to every element in a DataFrame¶
# using applymap to apply a function to every element
df.applymap(lambda x: x ** 2)
store1 | store2 | store3 | store4 | |
---|---|---|---|---|
product1 | 100 | 25 | 400 | 1600.0 |
product2 | 400 | 100 | 900 | 2500.0 |
product3 | 900 | 225 | 1600 | NaN |
The main differences between these methods are:
- map() is used to transform values in a Series or a single column of a DataFrame.
- apply() is used to apply a function to each OR 1 OR more than 1 row or column of a DataFrame.
- applymap() is used to apply a function element-wise to every element in a DataFrame.
Module 13 : Data management using Python-Pandas¶
import pandas as pd
# create two data frames
df1 = pd.DataFrame({'id': ['1001', '1002', '1003'], 'name': ['abc', 'mno', 'opq']})
df2 = pd.DataFrame({'id': ['1001', '1002', '1004'], 'salary': [1100.25, 2200.50, 3300.75]})
# merge the two data frames based 'on' the key column
merged_df = pd.merge(df1, df2, on='id')
# display the merged data frame
merged_df
id | name | salary | |
---|---|---|---|
0 | 1001 | abc | 1100.25 |
1 | 1002 | mno | 2200.50 |
- Note: The default is inner join only
Merging Data Frames with Different Merge Methods¶
Inner join or Inner Merge returns only the rows that have matching values in both data frames.¶
# merge the two data frames based on the id column using inner merge
merged_df = pd.merge(df1, df2, on='id', how='inner')
merged_df
id | name | salary | |
---|---|---|---|
0 | 1001 | abc | 1100.25 |
1 | 1002 | mno | 2200.50 |
In the "pd.merge()" function, "on" and "how" are keywords used to specify how the merge should be performed.
- "on" keyword is used to specify the column or index on which the merge should be performed
- "how" keyword is used to specify the type of merge to perform.
Left Join returns all the rows from the left data frame and the matching rows from the right data frame, and fills in missing values with NaN.¶
# merge the data frames using a left join
merged_df = pd.merge(df1, df2, on='id', how='left')
merged_df
id | name | salary | |
---|---|---|---|
0 | 1001 | abc | 1100.25 |
1 | 1002 | mno | 2200.50 |
2 | 1003 | opq | NaN |
Right Join is similar to a left join, but returns all the rows from the right data frame instead.¶
# merge the data frames using a right join
merged_df = pd.merge(df1, df2, on='id', how='right')
merged_df
id | name | salary | |
---|---|---|---|
0 | 1001 | abc | 1100.25 |
1 | 1002 | mno | 2200.50 |
2 | 1004 | NaN | 3300.75 |
Outer Join returns all the rows from both data frames, and fills in missing values with NaN where there are no matches.¶
# merge the data frames using a outer join
merged_df = pd.merge(df1,df2, on='id', how='outer')
merged_df
id | name | salary | |
---|---|---|---|
0 | 1001 | abc | 1100.25 |
1 | 1002 | mno | 2200.50 |
2 | 1003 | opq | NaN |
3 | 1004 | NaN | 3300.75 |
The "left_on" and "right_on" parameters are used to specify the names of the columns to be used for the merge operation from the left and right DataFrames respectively.¶¶
# create two data frames
df1 = pd.DataFrame({'id': ['1001', '1002', '1003'], 'name': ['abc', 'mno', 'opq']})
df2 = pd.DataFrame({'customer_id': ['1001', '1002', '1004'], 'salary': [1100.25, 2200.50, 3300.75]})
merged_data = pd.merge(df1, df2, left_on='id', right_on='customer_id', how='outer')
merged_data
id | name | customer_id | salary | |
---|---|---|---|---|
0 | 1001 | abc | 1001 | 1100.25 |
1 | 1002 | mno | 1002 | 2200.50 |
2 | 1003 | opq | NaN | NaN |
3 | NaN | NaN | 1004 | 3300.75 |
I hope you have noticed that forth ID in the last example has come black, so to avoid this issue, better we rename the ID variables and then merge¶
import pandas as pd
# create two data frames
df1 = pd.DataFrame({'id': ['1001', '1001', '1003'], 'name': ['abc', 'mno', 'opq']})
df2 = pd.DataFrame({'id': ['1001', '1001', '1004'], 'salary': [1100.25, 2200.50, 3300.75]})
# merge the two data frames based 'on' the key column
merged_df = pd.merge(df1, df2, on='id',how ='outer')
# display the merged data frame
merged_df
id | name | salary | |
---|---|---|---|
0 | 1001 | abc | 1100.25 |
1 | 1001 | abc | 2200.50 |
2 | 1001 | mno | 1100.25 |
3 | 1001 | mno | 2200.50 |
4 | 1003 | opq | NaN |
5 | 1004 | NaN | 3300.75 |
So, basically it creates a CARETSIAN product, like SQL¶
# Create two DataFrames
df1 = pd.DataFrame({'var1': [1, 2, 3], 'var2': [4, 5, 6]})
df2 = pd.DataFrame({'var3': [7, 8, 9], 'var4': [10, 11, 12]})
# Join the DataFrames on their index
result = df1.join(df2)
result
var1 | var2 | var3 | var4 | |
---|---|---|---|---|
0 | 1 | 4 | 7 | 10 |
1 | 2 | 5 | 8 | 11 |
2 | 3 | 6 | 9 | 12 |
# Create two DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [2, 3], 'D': [7, 8]})
# Join the DataFrames on their index
result = df1.join(df2, how='outer')
result
A | B | C | D | |
---|---|---|---|---|
0 | 1 | 4 | 2.0 | 7.0 |
1 | 2 | 5 | 3.0 | 8.0 |
2 | 3 | 6 | NaN | NaN |
Using the append() method:¶
import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': ['1', '2', '3', '4']})
df2 = pd.DataFrame({'col1': [5, 6, 7], 'col3': [1005, 1006, 1007]})
df = df1.append(df2, ignore_index=True)
df
C:\Users\ragarwal\AppData\Local\Temp\ipykernel_14508\733254066.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. df = df1.append(df2, ignore_index=True)
col1 | col2 | col3 | |
---|---|---|---|
0 | 1 | 1 | NaN |
1 | 2 | 2 | NaN |
2 | 3 | 3 | NaN |
3 | 4 | 4 | NaN |
4 | 5 | NaN | 1005.0 |
5 | 6 | NaN | 1006.0 |
6 | 7 | NaN | 1007.0 |
df.dtypes
col1 int64 col2 object col3 float64 dtype: object
#SQL Union
- Note: The ignore_index=True parameter is used to reset the index of the appended DataFrame to avoid duplicate index values.
Drop all additional columns from the appended DataFrame by specifying the 'drop()' method with the 'columns' parameter set to the list of additional columns.¶
# Create two DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [7, 8, 9], 'C': [10, 11, 12]})
# Append df2 to df1 and drop additional columns
result = df1.append(df2, ignore_index=True).drop(columns=df2.columns.difference(df1.columns))
result
C:\Users\ragarwal\AppData\Local\Temp\ipykernel_14508\397557649.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. result = df1.append(df2, ignore_index=True).drop(columns=df2.columns.difference(df1.columns))
A | B | |
---|---|---|
0 | 1 | 4.0 |
1 | 2 | 5.0 |
2 | 3 | 6.0 |
3 | 7 | NaN |
4 | 8 | NaN |
5 | 9 | NaN |
What happens if same column is Char in one and numeric in another ?¶
import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': ['1', '2', '3', '4']})
df2 = pd.DataFrame({'col1': ['5', '6', '7'], 'col3': [1005, 1006, 1007]})
df = df1.append(df2, ignore_index=True)
df
C:\Users\ragarwal\AppData\Local\Temp\ipykernel_14508\986494925.py:5: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. df = df1.append(df2, ignore_index=True)
col1 | col2 | col3 | |
---|---|---|---|
0 | 1 | 1 | NaN |
1 | 2 | 2 | NaN |
2 | 3 | 3 | NaN |
3 | 4 | 4 | NaN |
4 | 5 | NaN | 1005.0 |
5 | 6 | NaN | 1006.0 |
6 | 7 | NaN | 1007.0 |
df.dtypes
col1 object col2 object col3 float64 dtype: object
Using the concat() function: This function concatenates two or more DataFrames along a specified axis.¶¶
result = pd.concat([df1, df2], ignore_index=True)
result
col1 | col2 | col3 | |
---|---|---|---|
0 | 1 | 1 | NaN |
1 | 2 | 2 | NaN |
2 | 3 | 3 | NaN |
3 | 4 | 4 | NaN |
4 | 5 | NaN | 1005.0 |
5 | 6 | NaN | 1006.0 |
6 | 7 | NaN | 1007.0 |
Using the insert() method: This method can be used to insert a new column to a DataFrame at a specified position.¶
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df.insert(1, 'new_col', [5, 6])
df
col1 | new_col | col2 | |
---|---|---|---|
0 | 1 | 5 | 3 |
1 | 2 | 6 | 4 |
- Note: The resulting DataFrame has three columns, with the new column inserted at position 1.
Module 14 : Complex data transformation techniques : transform, .pivot() .melt() .transpose()¶
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
'year': [2019, 2019, 2019, 2020, 2020, 2020],
'sales': [100, 200, 150, 120, 180, 220]
})
df
name | year | sales | |
---|---|---|---|
0 | Alice | 2019 | 100 |
1 | Bob | 2019 | 200 |
2 | Charlie | 2019 | 150 |
3 | Alice | 2020 | 120 |
4 | Bob | 2020 | 180 |
5 | Charlie | 2020 | 220 |
df['mean_sales'] = df.groupby(['year'])['sales'].transform('mean')
df
name | year | sales | mean_sales | |
---|---|---|---|---|
0 | Alice | 2019 | 100 | 150.000000 |
1 | Bob | 2019 | 200 | 150.000000 |
2 | Charlie | 2019 | 150 | 150.000000 |
3 | Alice | 2020 | 120 | 173.333333 |
4 | Bob | 2020 | 180 | 173.333333 |
5 | Charlie | 2020 | 220 | 173.333333 |
Basic syntax: df.pivot(index=None, columns=None, values=None)
- index: This is the column or list of columns that will be used to index the rows of the pivoted dataframe.
- columns: This is the column that contains the values to be used as column headers in the pivoted dataframe.
- values: This is the column that contains the values to be used as the values of the pivoted dataframe.
Example 1:¶
pivoted_df = df.pivot(index='name', columns='year', values='sales')
pivoted_df
year | 2019 | 2020 |
---|---|---|
name | ||
Alice | 100 | 120 |
Bob | 200 | 180 |
Charlie | 150 | 220 |
Example 2: Pivot with aggregation function¶
pivoted_df = df.pivot_table(index='year', values='sales', aggfunc='sum')
pivoted_df
sales | |
---|---|
year | |
2019 | 450 |
2020 | 520 |
Example 3: Pivot with multiple index and column values¶
df = pd.DataFrame({
'name': ['Alice', 'Alice', 'Bob', 'Bob', 'Charlie', 'Charlie'],
'year': [2019, 2020, 2019, 2020, 2019, 2020],
'quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],
'sales': [100, 120, 200, 180, 150, 220] })
df
name | year | quarter | sales | |
---|---|---|---|---|
0 | Alice | 2019 | Q1 | 100 |
1 | Alice | 2020 | Q2 | 120 |
2 | Bob | 2019 | Q1 | 200 |
3 | Bob | 2020 | Q2 | 180 |
4 | Charlie | 2019 | Q1 | 150 |
5 | Charlie | 2020 | Q2 | 220 |
pivoted_df = df.pivot(index=['name', 'quarter'], columns='year', values='sales')
pivoted_df
year | 2019 | 2020 | |
---|---|---|---|
name | quarter | ||
Alice | Q1 | 100.0 | NaN |
Q2 | NaN | 120.0 | |
Bob | Q1 | 200.0 | NaN |
Q2 | NaN | 180.0 | |
Charlie | Q1 | 150.0 | NaN |
Q2 | NaN | 220.0 |
.melt() function : Unpivot a DataFrame from wide to long format.¶
Syntax: pd.melt(DF, id_vars=, value_vars=, var_name=, value_name='value')
Parameters :
id_var: This is index columns. Columns that should remain as-is in the output.
value_var: Columns that we want to convert to rows.
var_name: The name of the new column. How do you want to name your old columns under one column
value_name: The name of the column that will contain the stacked values.
data = {'Customer_id': ['cus_01', 'cus_02', 'cus_03', 'cus_04'],
'Days': ['Saturday','Saturday','Sunday','Sunday'],
'MinDuration':[2,4,4,2],
'MaxDuration':[4,6,6,4]}
df = pd.DataFrame(data)
df
Customer_id | Days | MinDuration | MaxDuration | |
---|---|---|---|---|
0 | cus_01 | Saturday | 2 | 4 |
1 | cus_02 | Saturday | 4 | 6 |
2 | cus_03 | Sunday | 4 | 6 |
3 | cus_04 | Sunday | 2 | 4 |
df_melt=pd.melt(df, id_vars=['Customer_id','Days'],
value_vars=['MinDuration','MaxDuration'],
var_name='DurationType',
value_name='Duration')
df_melt
Customer_id | Days | DurationType | Duration | |
---|---|---|---|---|
0 | cus_01 | Saturday | MinDuration | 2 |
1 | cus_02 | Saturday | MinDuration | 4 |
2 | cus_03 | Sunday | MinDuration | 4 |
3 | cus_04 | Sunday | MinDuration | 2 |
4 | cus_01 | Saturday | MaxDuration | 4 |
5 | cus_02 | Saturday | MaxDuration | 6 |
6 | cus_03 | Sunday | MaxDuration | 6 |
7 | cus_04 | Sunday | MaxDuration | 4 |
Example 1 : The shortcut for the transpose() function is .T¶
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [20, 30, 35],
'city': ['Boston', 'Oakland', 'San Francisco']
})
df
name | age | city | |
---|---|---|---|
0 | Alice | 20 | Boston |
1 | Bob | 30 | Oakland |
2 | Charlie | 35 | San Francisco |
transposed_df = df.T
transposed_df
0 | 1 | 2 | |
---|---|---|---|
name | Alice | Bob | Charlie |
age | 20 | 30 | 35 |
city | Boston | Oakland | San Francisco |
The resulting dataframe has an index made up of the column names from the original dataframe, and columns made up of the original row values.
Example 2 : Transpose a DataFrame with a DateTimeIndex¶
idx = pd.date_range('2021-01-01', periods=3, freq='D')
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}, index=idx)
df
A | B | C | |
---|---|---|---|
2021-01-01 | 1 | 4 | 7 |
2021-01-02 | 2 | 5 | 8 |
2021-01-03 | 3 | 6 | 9 |
new_df = df.T
new_df
2021-01-01 | 2021-01-02 | 2021-01-03 | |
---|---|---|---|
A | 1 | 2 | 3 |
B | 4 | 5 | 6 |
C | 7 | 8 | 9 |
Module 15 : Miscellaneous¶
# how to sort a data
#importing a csv file with pd.read_csv
df = pd.read_csv("input_datsets/Sample_Fraud_Detection.csv")
df.head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2/24/2015 22:55 | 4/18/2015 2:47 | 65278 | QVPSPJUOCKZAR | SEO | Chrome | M | 7.327584e+08 | 0 | home_essentials | 2/22/1976 | Aaron P Maashoh | aaron.maashoh@yahoo.com |
1 | 333320 | 6/7/2015 20:39 | 6/8/2015 1:38 | 96399 | EOGFQPIZPYXFZ | Ads | Chrome | F | 3.503114e+08 | 0 | apparels | 1/2/1962 | Rick D Rothackerj | rick_rothackerj@yahoo.com |
2 | 1359 | 1/1/2015 18:52 | 1/1/2015 18:52 | 57296 | YSSKYOSJHPPLJ | SEO | Opera | M | 2.621474e+09 | 1 | electronics | 3/5/1962 | Harriet M McLeodd | harriet.mcleodd@gmail.com |
3 | 150084 | 4/28/2015 21:13 | 5/4/2015 13:54 | 43650 | ATGTXKYKUDUQN | SEO | Safari | M | 3.840542e+09 | 0 | health_care | 7/3/1974 | Sinead M Carews | sinead.carews@gmail.com |
4 | 221365 | 7/21/2015 7:09 | 9/9/2015 18:40 | 45016 | NAUITBZFJKHWW | Ads | Safari | M | 4.155831e+08 | 0 | home_essentials | 8/25/1970 | Laurence H Frosty | laurence.frosty@tntech.edu |
df.sort_values(['source','browser'],ascending= [True,False],inplace=True,ignore_index=True)
df.head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 221365 | 7/21/2015 7:09 | 9/9/2015 18:40 | 45016 | NAUITBZFJKHWW | Ads | Safari | M | 4.155831e+08 | 0 | home_essentials | 8/25/1970 | Laurence H Frosty | laurence.frosty@tntech.edu |
1 | 199700 | 7/11/2015 18:26 | 10/28/2015 21:59 | 7603 | TEPSJVVXGNTYR | Ads | Safari | F | 1.819009e+09 | 0 | health_care | 6/17/1980 | Matt U Falloonz | matt.falloonz@brandermill.com |
2 | 119824 | 3/20/2015 0:31 | 4/5/2015 7:31 | 51106 | WFIIFCPIOGMHT | Ads | Safari | M | 1.314238e+05 | 0 | cosmetics | 1/16/1977 | Herbert P Lashb | herbert.lashb@gmail.com |
3 | 192767 | 1/8/2015 12:17 | 4/28/2015 2:09 | 85643 | ZTORJKTDMYTJH | Ads | Safari | M | 2.568640e+09 | 0 | electronics | 2/21/1983 | Costa P Alisterq | costa.alisterq@flossresearch.com |
4 | 310147 | 7/6/2015 5:45 | 7/22/2015 4:19 | 70400 | GAPQDLEFDZYCO | Ads | Safari | F | 2.626781e+09 | 0 | apparels | 8/20/1995 | Aileen E Wangy | aileen.wangy@brandermill.com |
# How to delete all the data from a Dataframe
df.drop(df.index , inplace=True)
df.head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | ip_address | class | category | dob | name |
---|
No comments:
Post a Comment
Do provide us your feedback, it would help us serve your better.