Module 0: Before jumping to Pyspark, basics of Python as Pandas should be explored¶
Python basics, concepts of User-Defined Functions (UDF) and Lambda Functions , connection of to Python to Oracle, SQL Server, MySQL, and PostgreSQL can be explored from the following link:¶
Module 1: Importing data in as PySpark - with read.csv and read_excel¶
In [1]:
#if not already installed, use below to install PySpark:
# !pip install pyspark
In [2]:
import pyspark.sql
In [3]:
import warnings
warnings.filterwarnings('ignore')
# to eanble warning
# import warnings
# warnings.filterwarnings('default')
In [4]:
# Following commands are optional, these command install necessary packages required to run PySpark in your system
# It might be possible that your system already have these packages installed.
In [5]:
#conda install -c conda-forge openjdk
In [6]:
#conda update -n base -c defaults conda
In [7]:
#conda install -c conda-forge findspark
In [8]:
#In the spark session, we can customize the RAM consumption
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName('test').config("spark.driver.memory", "16g").getOrCreate()
In [9]:
#Lets start reading our first data
df1 = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
df1.show(2)
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ |user_id| signup_time| purchase_time|purchase_value| device_id|source|browser|sex| ip_address|class| category| dob| name| email| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ | 22058|2/24/2015 22:55|4/18/2015 2:47| 65278|QVPSPJUOCKZAR| SEO| Chrome| M|732758368.8| 0|home_essentials|2/22/1976| Aaron P Maashoh|aaron.maashoh@yah...| | 333320| 6/7/2015 20:39| 6/8/2015 1:38| 96399|EOGFQPIZPYXFZ| Ads| Chrome| F|350311387.9| 0| apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ only showing top 2 rows
In [10]:
df1.dtypes #Will give the columns' names and data type
Out[10]:
[('user_id', 'string'), ('signup_time', 'string'), ('purchase_time', 'string'), ('purchase_value', 'string'), ('device_id', 'string'), ('source', 'string'), ('browser', 'string'), ('sex', 'string'), ('ip_address', 'string'), ('class', 'string'), ('category', 'string'), ('dob', 'string'), ('name', 'string'), ('email', 'string')]
In [11]:
# Importing all functions form PySpark SQL at once
from pyspark.sql.functions import *
Suppose the data has few rows blank in the top of file, or with some narration, we can skip rows¶
In [12]:
# Step 1 - We read few records on the top to get column names
df2 = spark.read.format('csv').load('input_datsets/Sample_Fraud_Detection _with_comments.csv'
,header=False,inferSchema=True).withColumn("index",monotonically_increasing_id()
).filter((col("index") >=4) & (col("index") <=6)).drop("index")
df2.show()
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ | _c0| _c1| _c2| _c3| _c4| _c5| _c6|_c7| _c8| _c9| _c10| _c11| _c12| _c13| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ |user_id| signup_time| purchase_time|purchase_value| device_id|source|browser|sex| ip_address|class| category| dob| name| email| | 22058|2/24/2015 22:55|4/18/2015 2:47| 65278|QVPSPJUOCKZAR| SEO| Chrome| M|732758368.8| 0|home_essentials|2/22/1976| Aaron P Maashoh|aaron.maashoh@yah...| | 333320| 6/7/2015 20:39| 6/8/2015 1:38| 96399|EOGFQPIZPYXFZ| Ads| Chrome| F|350311387.9| 0| apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
In [13]:
# Step 2 - Let's read the data where it begins at
df_main = spark.read.format('csv').load('input_datsets/Sample_Fraud_Detection _with_comments.csv'
,header=False,inferSchema=True).withColumn("index", monotonically_increasing_id()
).filter(col("index") >= 5).drop("index")
df_main.show(2)
+------+---------------+--------------+-----+-------------+---+------+---+-----------+---+---------------+---------+-----------------+--------------------+ | _c0| _c1| _c2| _c3| _c4|_c5| _c6|_c7| _c8|_c9| _c10| _c11| _c12| _c13| +------+---------------+--------------+-----+-------------+---+------+---+-----------+---+---------------+---------+-----------------+--------------------+ | 22058|2/24/2015 22:55|4/18/2015 2:47|65278|QVPSPJUOCKZAR|SEO|Chrome| M|732758368.8| 0|home_essentials|2/22/1976| Aaron P Maashoh|aaron.maashoh@yah...| |333320| 6/7/2015 20:39| 6/8/2015 1:38|96399|EOGFQPIZPYXFZ|Ads|Chrome| F|350311387.9| 0| apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...| +------+---------------+--------------+-----+-------------+---+------+---+-----------+---+---------------+---------+-----------------+--------------------+ only showing top 2 rows
In [14]:
# Saving column names in a separate object
df3 = df2.head().asDict().values()
print(df3)
dict_values(['user_id', 'signup_time', 'purchase_time', 'purchase_value', 'device_id', 'source', 'browser', 'sex', 'ip_address', 'class', 'category', 'dob', 'name', 'email'])
In [15]:
my_list = list(df3)
print(my_list)
['user_id', 'signup_time', 'purchase_time', 'purchase_value', 'device_id', 'source', 'browser', 'sex', 'ip_address', 'class', 'category', 'dob', 'name', 'email']
In [16]:
# toDF can be used to rename variables, it requires a "list with new variable names" as input
df_final = df_main.toDF(*my_list)
df_final.show(2)
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ |user_id| signup_time| purchase_time|purchase_value| device_id|source|browser|sex| ip_address|class| category| dob| name| email| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ | 22058|2/24/2015 22:55|4/18/2015 2:47| 65278|QVPSPJUOCKZAR| SEO| Chrome| M|732758368.8| 0|home_essentials|2/22/1976| Aaron P Maashoh|aaron.maashoh@yah...| | 333320| 6/7/2015 20:39| 6/8/2015 1:38| 96399|EOGFQPIZPYXFZ| Ads| Chrome| F|350311387.9| 0| apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ only showing top 2 rows
In [17]:
# let's try modifying the list and then use it for renaming the variables
my_list[11] = 'DATE_OF_BIRTH'
print(my_list)
['user_id', 'signup_time', 'purchase_time', 'purchase_value', 'device_id', 'source', 'browser', 'sex', 'ip_address', 'class', 'category', 'DATE_OF_BIRTH', 'name', 'email']
In [18]:
df_final = df_final.toDF(*my_list)
df_final.show(2)
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-------------+-----------------+--------------------+ |user_id| signup_time| purchase_time|purchase_value| device_id|source|browser|sex| ip_address|class| category|DATE_OF_BIRTH| name| email| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-------------+-----------------+--------------------+ | 22058|2/24/2015 22:55|4/18/2015 2:47| 65278|QVPSPJUOCKZAR| SEO| Chrome| M|732758368.8| 0|home_essentials| 2/22/1976| Aaron P Maashoh|aaron.maashoh@yah...| | 333320| 6/7/2015 20:39| 6/8/2015 1:38| 96399|EOGFQPIZPYXFZ| Ads| Chrome| F|350311387.9| 0| apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-------------+-----------------+--------------------+ only showing top 2 rows
In [19]:
#Let's delete unnecessary datasets, equivalent to Proc delete of SAS
del df_main , df2
In [20]:
# df_main.show() # please un-comment and run to check that objects have been dropped. This will thorw error!
Module 2: Let's do some basic check of data¶
In [21]:
# check top 3 rows
df1 = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
df1.head(3)
Out[21]:
[Row(user_id='22058', signup_time='2/24/2015 22:55', purchase_time='4/18/2015 2:47', purchase_value='65278', device_id='QVPSPJUOCKZAR', source='SEO', browser='Chrome', sex='M', ip_address='732758368.8', class='0', category='home_essentials', dob='2/22/1976', name='Aaron P Maashoh', email='aaron.maashoh@yahoo.com'), Row(user_id='333320', signup_time='6/7/2015 20:39', purchase_time='6/8/2015 1:38', purchase_value='96399', device_id='EOGFQPIZPYXFZ', source='Ads', browser='Chrome', sex='F', ip_address='350311387.9', class='0', category='apparels', dob='1/2/1962', name='Rick D Rothackerj', email='rick_rothackerj@yahoo.com'), Row(user_id='1359', signup_time='1/1/2015 18:52', purchase_time='1/1/2015 18:52', purchase_value='57296', device_id='YSSKYOSJHPPLJ', source='SEO', browser='Opera', sex='M', ip_address='2621473820', class='1', category='electronics', dob='3/5/1962', name='Harriet M McLeodd', email='harriet.mcleodd@gmail.com')]
In [22]:
# check bottom 4 rows
df1.tail(4)
Out[22]:
[Row(user_id='328579', signup_time='4/10/2015 1:02', purchase_time='6/25/2015 2:05', purchase_value='54775', device_id='UQCGEFBSOAVIJ', source='SEO', browser='Chrome', sex='M', ip_address='1038384255', class='0', category='electronics', dob='4/6/1994', name='Praveen L Menonw', email='praveen.menonw@gmail.com'), Row(user_id='263174', signup_time='7/14/2015 2:13', purchase_time='7/24/2015 1:25', purchase_value='84302', device_id='HVSGFVZSDBKEH', source='Ads', browser='Safari', sex='M', ip_address='4169079617', class='0', category='electronics', dob='9/1/1979', name='Ra T Alpern', email='ra.alpern@flossresearch.com'), Row(user_id='117191', signup_time='2/28/2015 17:14', purchase_time='4/14/2015 15:39', purchase_value='20193', device_id='GDHCTKIKPHENW', source='Direct', browser='IE', sex='M', ip_address='1411318028', class='0', category='apparels', dob='11/17/1979', name='Julia D Edwardsj', email='julia.edwardsj@sutletgroup.com'), Row(user_id='65732', signup_time='1/11/2015 11:10', purchase_time='1/22/2015 4:16', purchase_value='26225', device_id='VSMNAOFPSEQOL', source='Ads', browser='IE', sex='M', ip_address='3765208398', class='0', category='apparels', dob='3/23/1988', name='Conor V Humphriesk', email='conor.humphriesk@gmail.com')]
In [23]:
# Columns and data types
print(df1)
DataFrame[user_id: string, signup_time: string, purchase_time: string, purchase_value: string, device_id: string, source: string, browser: string, sex: string, ip_address: string, class: string, category: string, dob: string, name: string, email: string]
In [24]:
# Convert the "numeric value" column from string to integer/float
df1 = df1.withColumn("purchase_value", col("purchase_value").cast("float"))
df1 = df1.withColumn("class", col("class").cast("integer"))
In [25]:
print(df1)
DataFrame[user_id: string, signup_time: string, purchase_time: string, purchase_value: float, device_id: string, source: string, browser: string, sex: string, ip_address: string, class: int, category: string, dob: string, name: string, email: string]
In [26]:
mean_value = df1.select(mean("purchase_value").alias("avg_value"))
mean_value.show()
+-----------------+ | avg_value| +-----------------+ |50992.49149149149| +-----------------+
In [27]:
# Basic summarization of a variable
mean_value = df1.select(mean("purchase_value")).collect()[0][0]
max1 = df1.select(max("purchase_value")).collect()[0][0]
min1 = df1.select(min("purchase_value")).collect()[0][0]
sum1 = df1.select(sum("purchase_value")).collect()[0][0]
print("Mean = " , mean_value)
print("Max = " , max1)
print("Min = " , min1)
print("Sum = " , sum1)
Mean = 50992.49149149149 Max = 99743.0 Min = 1176.0 Sum = 50941499.0
In [28]:
# Let's find the mean of all the numeric variables in a dataframe
nunumeric_cols = [c[0] for c in df1.dtypes if c[1] in ("double", "float", "int", "long")]
print (nunumeric_cols)
['purchase_value', 'class']
In [29]:
length = len(nunumeric_cols)
print(length)
2
In [30]:
for i in range(len(nunumeric_cols)):
mean_value = df1.select(mean(nunumeric_cols[i])).collect()[0][0]
max1 = df1.select(max(nunumeric_cols[i])).collect()[0][0]
min1 = df1.select(min(nunumeric_cols[i])).collect()[0][0]
sum1 = df1.select(sum(nunumeric_cols[i])).collect()[0][0]
print('Mean Value of ', nunumeric_cols[i], 'is' ,mean_value)
print("Max Value of ", nunumeric_cols[i], "is" ,max1)
print("Min Value of " ,nunumeric_cols[i], "is" ,min1)
print("Sum Value of ", nunumeric_cols[i] ,"is" ,sum1)
Mean Value of purchase_value is 50992.49149149149 Max Value of purchase_value is 99743.0 Min Value of purchase_value is 1176.0 Sum Value of purchase_value is 50941499.0 Mean Value of class is 0.1011011011011011 Max Value of class is 1 Min Value of class is 0 Sum Value of class is 101
In [31]:
# Use describe function to find the basis stats of a numeric variable
df1.select('purchase_value').describe().show()
+-------+------------------+ |summary| purchase_value| +-------+------------------+ | count| 999| | mean| 50992.49149149149| | stddev|28681.076653460594| | min| 1176.0| | max| 99743.0| +-------+------------------+
In [32]:
# To check the basis stats of all numeric variables,
# we can do it using for loop and iterating it over all numeric variables
for i in range(len(nunumeric_cols)):
df1.select(nunumeric_cols[i]).describe().show()
+-------+------------------+ |summary| purchase_value| +-------+------------------+ | count| 999| | mean| 50992.49149149149| | stddev|28681.076653460594| | min| 1176.0| | max| 99743.0| +-------+------------------+ +-------+-------------------+ |summary| class| +-------+-------------------+ | count| 999| | mean| 0.1011011011011011| | stddev|0.30161354453884753| | min| 0| | max| 1| +-------+-------------------+
In [33]:
# To know the frequency distribution of a string/categorical vairable, we can use following groupBy function
# The output is similar to Proc Freq of SAS
df_freq = df1.groupBy('browser').agg(count('browser').alias('frequency'))
df_freq.show()
+-------+---------+ |browser|frequency| +-------+---------+ |FireFox| 155| | Safari| 166| | IE| 260| | Chrome| 397| | Opera| 21| +-------+---------+
In [34]:
# We can sort the output based on frequency, "desc" is an optional argument for sort data in descending way
df_freq = df1.groupBy('browser').agg(count('browser').alias('frequency')).orderBy(desc('frequency'))
df_freq.show()
+-------+---------+ |browser|frequency| +-------+---------+ | Chrome| 397| | IE| 260| | Safari| 166| |FireFox| 155| | Opera| 21| +-------+---------+
In [35]:
# Count gives number of records in a data
df1.count()
Out[35]:
999
In [36]:
# We can use following method to get the frequency and %age distribution
df_freq = df1.groupBy('browser').agg(count('browser').alias('frequency'),(count('browser')
/ df1.count() *100).alias('percentage')).orderBy(desc('frequency'))
df_freq.show()
+-------+---------+------------------+ |browser|frequency| percentage| +-------+---------+------------------+ | Chrome| 397| 39.73973973973974| | IE| 260|26.026026026026027| | Safari| 166|16.616616616616618| |FireFox| 155|15.515515515515515| | Opera| 21|2.1021021021021022| +-------+---------+------------------+
In [37]:
df_freq_1 = df_freq.select('browser','frequency', bround('percentage',1).alias('percent'))
df_freq_1.show()
+-------+---------+-------+ |browser|frequency|percent| +-------+---------+-------+ | Chrome| 397| 39.7| | IE| 260| 26.0| | Safari| 166| 16.6| |FireFox| 155| 15.5| | Opera| 21| 2.1| +-------+---------+-------+
In [38]:
# let's iterate the above method for all char /categorical variables
categorical_cols = [c[0] for c in df1.dtypes if c[1] in ('category', 'object' , 'string')]
print(categorical_cols)
['user_id', 'signup_time', 'purchase_time', 'device_id', 'source', 'browser', 'sex', 'ip_address', 'category', 'dob', 'name', 'email']
In [39]:
# Let's delete few columns from the list, that need not be analyzed
to_delete = ['user_id','device_id','signup_time','purchase_time','name','dob','email','ip_address']
for x in to_delete:
if x in categorical_cols:
categorical_cols.remove(x)
print(categorical_cols)
['source', 'browser', 'sex', 'category']
In [40]:
for col in categorical_cols:
df_freq = df1.groupBy(col).agg(
count(col).alias('frequency'),
(count(col)/df1.count() * 100).alias('percentage')
).orderBy(col)
df_freq.show()
+------+---------+-----------------+ |source|frequency| percentage| +------+---------+-----------------+ | Ads| 408|40.84084084084084| |Direct| 192|19.21921921921922| | SEO| 399|39.93993993993994| +------+---------+-----------------+ +-------+---------+------------------+ |browser|frequency| percentage| +-------+---------+------------------+ | Chrome| 397| 39.73973973973974| |FireFox| 155|15.515515515515515| | IE| 260|26.026026026026027| | Opera| 21|2.1021021021021022| | Safari| 166|16.616616616616618| +-------+---------+------------------+ +----+---------+-----------------+ | sex|frequency| percentage| +----+---------+-----------------+ |null| 0| 0.0| | F| 414|41.44144144144144| | M| 560|56.05605605605606| +----+---------+-----------------+ +---------------+---------+-----------------+ | category|frequency| percentage| +---------------+---------+-----------------+ | apparels| 204|20.42042042042042| | cosmetics| 210|21.02102102102102| | electronics| 204|20.42042042042042| | health_care| 187|18.71871871871872| |home_essentials| 194|19.41941941941942| +---------------+---------+-----------------+
In [41]:
# Find number of missing/NULL values in a field
null_counts = df1.agg(sum(df1.sex.isNull().cast("int")).alias('missing_sex'))
null_counts.show()
+-----------+ |missing_sex| +-----------+ | 25| +-----------+
In [42]:
#Use following codes to identify how many NULL values are there in each field
# This is equivalent to df.isna().sum() of Pandas
from pyspark.sql.functions import col, sum
null_counts = df1.agg(*[sum(col(c).isNull().cast("int")).alias(c) for c in df1.columns])
null_counts.show(vertical = True)
-RECORD 0------------- user_id | 0 signup_time | 0 purchase_time | 0 purchase_value | 0 device_id | 0 source | 0 browser | 0 sex | 25 ip_address | 0 class | 0 category | 0 dob | 29 name | 0 email | 0
In [43]:
# to find distinct values of a single column
distinct_values = df1.select("browser").distinct().collect()
print(distinct_values)
[Row(browser='FireFox'), Row(browser='Safari'), Row(browser='IE'), Row(browser='Chrome'), Row(browser='Opera')]
In [44]:
# To find count of unique value of each column
unique_values_count = df1.agg(countDistinct('signup_time').alias('unique_count'))
unique_values_count.show()
+------------+ |unique_count| +------------+ | 995| +------------+
In [45]:
distinct_counts = df1.select(*[countDistinct(c).alias(c) for c in df1.columns])
distinct_counts.show(vertical = True)
-RECORD 0------------- user_id | 999 signup_time | 995 purchase_time | 998 purchase_value | 994 device_id | 994 source | 3 browser | 5 sex | 2 ip_address | 995 class | 2 category | 5 dob | 917 name | 999 email | 998
In [46]:
# Sometimes with a very large data, we dont need a precise result, rather a "best BUT quick guess"
# for that we can use approx_count_distinct, which is pretty fast in processing
distinct_counts = df1.select(*[approx_count_distinct(c).alias(c) for c in df1.columns])
distinct_counts.show(vertical = True)
-RECORD 0-------------- user_id | 1073 signup_time | 1098 purchase_time | 1124 purchase_value | 925 device_id | 890 source | 3 browser | 5 sex | 2 ip_address | 983 class | 2 category | 5 dob | 950 name | 1012 email | 1073
Following is equivalent to proc Freq functionality in SAS¶
In [47]:
value_counts = df1.groupBy("source","browser").count()
value_counts.show()
+------+-------+-----+ |source|browser|count| +------+-------+-----+ | Ads| Opera| 9| | SEO| Opera| 9| | Ads| Safari| 75| |Direct| IE| 48| | Ads| Chrome| 151| | SEO| Safari| 58| |Direct|FireFox| 27| | SEO| Chrome| 165| |Direct| Chrome| 81| |Direct| Opera| 3| | Ads| IE| 109| | SEO| IE| 103| | Ads|FireFox| 64| | SEO|FireFox| 64| |Direct| Safari| 33| +------+-------+-----+
In [48]:
# we can use count distinct as well
unique_values_count3 = df1.groupBy("source","browser").agg(countDistinct('signup_time').alias('unique_count'))
unique_values_count3.show()
+------+-------+------------+ |source|browser|unique_count| +------+-------+------------+ | Ads| Opera| 9| | SEO| Opera| 9| | Ads| Safari| 73| |Direct| IE| 47| | Ads| Chrome| 151| | SEO| Safari| 58| |Direct|FireFox| 27| | SEO| Chrome| 165| |Direct| Chrome| 81| |Direct| Opera| 3| | Ads| IE| 109| | SEO| IE| 102| | Ads|FireFox| 64| | SEO|FireFox| 64| |Direct| Safari| 33| +------+-------+------------+
Module 3: Useful fuctions to get basic information about the data¶
.printSchema()- will also give information about missing values¶
In [49]:
df1.printSchema()
root |-- user_id: string (nullable = true) |-- signup_time: string (nullable = true) |-- purchase_time: string (nullable = true) |-- purchase_value: float (nullable = true) |-- device_id: string (nullable = true) |-- source: string (nullable = true) |-- browser: string (nullable = true) |-- sex: string (nullable = true) |-- ip_address: string (nullable = true) |-- class: integer (nullable = true) |-- category: string (nullable = true) |-- dob: string (nullable = true) |-- name: string (nullable = true) |-- email: string (nullable = true)
In [50]:
#To get all the columns of data frame
df1.dtypes
Out[50]:
[('user_id', 'string'), ('signup_time', 'string'), ('purchase_time', 'string'), ('purchase_value', 'float'), ('device_id', 'string'), ('source', 'string'), ('browser', 'string'), ('sex', 'string'), ('ip_address', 'string'), ('class', 'int'), ('category', 'string'), ('dob', 'string'), ('name', 'string'), ('email', 'string')]
In [51]:
# gives numbers of records
num_rows = df1.count()
print("Number of rows: {}".format(num_rows))
Number of rows: 999
In [52]:
# Gives numbers of records and number of fields in data
num_rows = df1.count()
num_cols = len(df1.columns)
print("Shape of DataFrame: ({}, {})".format(num_rows, num_cols))
Shape of DataFrame: (999, 14)
Module 4 : Subsetting data Vertically and Horizontally (selecting subset of a DataFrame)¶
In [53]:
# Selecting only few columns from a DataFrame
df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
sample = df.select("name")
sample.show(2)
+-----------------+ | name| +-----------------+ | Aaron P Maashoh| |Rick D Rothackerj| +-----------------+ only showing top 2 rows
In [54]:
sample_1 = df.select("name","dob","sex")
sample_1.show(2)
+-----------------+---------+---+ | name| dob|sex| +-----------------+---------+---+ | Aaron P Maashoh|2/22/1976| M| |Rick D Rothackerj| 1/2/1962| F| +-----------------+---------+---+ only showing top 2 rows
In [55]:
# We can create a monotonic - row id (_N_ equivalent of SAS) following way
sample_2 = sample_1.withColumn("index", monotonically_increasing_id())
sample_2.show(5)
+-----------------+---------+---+-----+ | name| dob|sex|index| +-----------------+---------+---+-----+ | Aaron P Maashoh|2/22/1976| M| 0| |Rick D Rothackerj| 1/2/1962| F| 1| |Harriet M McLeodd| 3/5/1962| M| 2| | Sinead M Carews| 7/3/1974| M| 3| |Laurence H Frosty|8/25/1970| M| 4| +-----------------+---------+---+-----+ only showing top 5 rows
In [56]:
# Selecting few rows from DataFrame
sample_2=sample_1.withColumn("index", monotonically_increasing_id()).filter(col("index") >=995).drop("index")
sample_2.show()
+------------------+----------+---+ | name| dob|sex| +------------------+----------+---+ | Praveen L Menonw| 4/6/1994| M| | Ra T Alpern| 9/1/1979| M| | Julia D Edwardsj|11/17/1979| M| |Conor V Humphriesk| 3/23/1988| M| +------------------+----------+---+
In [57]:
# Selecting range of rows from DataFrame - 3rd to 10the record
sample_2=sample_1.withColumn("index", monotonically_increasing_id()
).filter((col("index") >=3) & (col("index") <=10) )
sample_2.show()
+--------------------+----------+---+-----+ | name| dob|sex|index| +--------------------+----------+---+-----+ | Sinead M Carews| 7/3/1974| M| 3| | Laurence H Frosty| 8/25/1970| M| 4| | Lee Z Chyenz| 4/8/1997| M| 5| |Dhanya J Skariachano| 5/1/1996| F| 6| | Lisa E Baertleinu|11/24/1981| M| 7| | Andreas B Cremerf| 1/22/1972| F| 8| | Ra J Alperx| 2/6/1984| M| 9| | Matt U Falloonz| 6/17/1980| F| 10| +--------------------+----------+---+-----+
How to Reaname the Variables¶
In [58]:
# Method 1 - Basic method
method_1 = df.withColumnRenamed('sex','gender') \
.withColumnRenamed('dob','birth_date') \
.withColumnRenamed('email','email_id')
method_1.show(2,vertical = False, truncate = False )
+-------+---------------+--------------+--------------+-------------+------+-------+------+-----------+-----+---------------+----------+-----------------+-------------------------+ |user_id|signup_time |purchase_time |purchase_value|device_id |source|browser|gender|ip_address |class|category |birth_date|name |email_id | +-------+---------------+--------------+--------------+-------------+------+-------+------+-----------+-----+---------------+----------+-----------------+-------------------------+ |22058 |2/24/2015 22:55|4/18/2015 2:47|65278 |QVPSPJUOCKZAR|SEO |Chrome |M |732758368.8|0 |home_essentials|2/22/1976 |Aaron P Maashoh |aaron.maashoh@yahoo.com | |333320 |6/7/2015 20:39 |6/8/2015 1:38 |96399 |EOGFQPIZPYXFZ|Ads |Chrome |F |350311387.9|0 |apparels |1/2/1962 |Rick D Rothackerj|rick_rothackerj@yahoo.com| +-------+---------------+--------------+--------------+-------------+------+-------+------+-----------+-----+---------------+----------+-----------------+-------------------------+ only showing top 2 rows
In [59]:
# Method 2 - Little complex
df.createOrReplaceTempView("temp")
method_2 = spark.sql("SELECT * FROM temp").withColumnRenamed("Name", "Name_full").select("Name_full", "dob", "sex")
# Show the DataFrame with index
method_2.show(2)
+-----------------+---------+---+ | Name_full| dob|sex| +-----------------+---------+---+ | Aaron P Maashoh|2/22/1976| M| |Rick D Rothackerj| 1/2/1962| F| +-----------------+---------+---+ only showing top 2 rows
In [60]:
# Method 3 - Simpler for legacy SQL users
df.createOrReplaceTempView("temp")
# Or we can simply do renaming within SQL query
method_3 = spark.sql("SELECT name as Name_full, dob, sex FROM temp")
# Show the DataFrame with index
method_3.show(2)
+-----------------+---------+---+ | Name_full| dob|sex| +-----------------+---------+---+ | Aaron P Maashoh|2/22/1976| M| |Rick D Rothackerj| 1/2/1962| F| +-----------------+---------+---+ only showing top 2 rows
In [61]:
# We have used it above as well, this is the method to create the row index - similar to _N_ of SAS
df_index = df.withColumn("index", monotonically_increasing_id())
df_index.show(2)
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+-----+ |user_id| signup_time| purchase_time|purchase_value| device_id|source|browser|sex| ip_address|class| category| dob| name| email|index| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+-----+ | 22058|2/24/2015 22:55|4/18/2015 2:47| 65278|QVPSPJUOCKZAR| SEO| Chrome| M|732758368.8| 0|home_essentials|2/22/1976| Aaron P Maashoh|aaron.maashoh@yah...| 0| | 333320| 6/7/2015 20:39| 6/8/2015 1:38| 96399|EOGFQPIZPYXFZ| Ads| Chrome| F|350311387.9| 0| apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...| 1| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+-----+ only showing top 2 rows
In [62]:
# Select columns based on position i.e. Index
df2 = df[df.columns[1:5]]
df2.show(2)
+---------------+--------------+--------------+-------------+ | signup_time| purchase_time|purchase_value| device_id| +---------------+--------------+--------------+-------------+ |2/24/2015 22:55|4/18/2015 2:47| 65278|QVPSPJUOCKZAR| | 6/7/2015 20:39| 6/8/2015 1:38| 96399|EOGFQPIZPYXFZ| +---------------+--------------+--------------+-------------+ only showing top 2 rows
In [63]:
# Select a set of Columns between Labels(column-names), in following example we need columns from user_id to dob
df3 = df.select(df.columns[df.columns.index('signup_time'):df.columns.index('device_id') + 1])
# show the resulting DataFrame
df3.show(2)
+---------------+--------------+--------------+-------------+ | signup_time| purchase_time|purchase_value| device_id| +---------------+--------------+--------------+-------------+ |2/24/2015 22:55|4/18/2015 2:47| 65278|QVPSPJUOCKZAR| | 6/7/2015 20:39| 6/8/2015 1:38| 96399|EOGFQPIZPYXFZ| +---------------+--------------+--------------+-------------+ only showing top 2 rows
In [64]:
# Add a unique ID column to the DataFrame
df_with_id = df.withColumn("id", monotonically_increasing_id())
# Filter out the rows with few selected IDs
df_skipped_rows = df_with_id.filter(df_with_id.id.isin([0, 2, 4, 7]))
df_skipped_rows.show(2)
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+---+ |user_id| signup_time| purchase_time|purchase_value| device_id|source|browser|sex| ip_address|class| category| dob| name| email| id| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+---+ | 22058|2/24/2015 22:55|4/18/2015 2:47| 65278|QVPSPJUOCKZAR| SEO| Chrome| M|732758368.8| 0|home_essentials|2/22/1976| Aaron P Maashoh|aaron.maashoh@yah...| 0| | 1359| 1/1/2015 18:52|1/1/2015 18:52| 57296|YSSKYOSJHPPLJ| SEO| Opera| M| 2621473820| 1| electronics| 3/5/1962|Harriet M McLeodd|harriet.mcleodd@g...| 2| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+---+ only showing top 2 rows
How to drop variables?¶
In [65]:
s1 = df.drop("dob")
s1.show(2)
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-----------------+--------------------+ |user_id| signup_time| purchase_time|purchase_value| device_id|source|browser|sex| ip_address|class| category| name| email| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-----------------+--------------------+ | 22058|2/24/2015 22:55|4/18/2015 2:47| 65278|QVPSPJUOCKZAR| SEO| Chrome| M|732758368.8| 0|home_essentials| Aaron P Maashoh|aaron.maashoh@yah...| | 333320| 6/7/2015 20:39| 6/8/2015 1:38| 96399|EOGFQPIZPYXFZ| Ads| Chrome| F|350311387.9| 0| apparels|Rick D Rothackerj|rick_rothackerj@y...| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-----------------+--------------------+ only showing top 2 rows
In [66]:
s2 = df.drop("dob","email","purchase_time")
s2.show(2)
+-------+---------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-----------------+ |user_id| signup_time|purchase_value| device_id|source|browser|sex| ip_address|class| category| name| +-------+---------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-----------------+ | 22058|2/24/2015 22:55| 65278|QVPSPJUOCKZAR| SEO| Chrome| M|732758368.8| 0|home_essentials| Aaron P Maashoh| | 333320| 6/7/2015 20:39| 96399|EOGFQPIZPYXFZ| Ads| Chrome| F|350311387.9| 0| apparels|Rick D Rothackerj| +-------+---------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-----------------+ only showing top 2 rows
In [67]:
# Remove all columns between column index 1 to 4
s3 = df.drop(*df.columns[1:5])
s3.show(2)
+-------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ |user_id|source|browser|sex| ip_address|class| category| dob| name| email| +-------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ | 22058| SEO| Chrome| M|732758368.8| 0|home_essentials|2/22/1976| Aaron P Maashoh|aaron.maashoh@yah...| | 333320| Ads| Chrome| F|350311387.9| 0| apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...| +-------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ only showing top 2 rows
Module 5: Conditional filter and LOGICAL operators similar to where clause in SAS/SQL¶
Extract data with a single condition¶
In [68]:
df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
df.show(2)
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ |user_id| signup_time| purchase_time|purchase_value| device_id|source|browser|sex| ip_address|class| category| dob| name| email| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ | 22058|2/24/2015 22:55|4/18/2015 2:47| 65278|QVPSPJUOCKZAR| SEO| Chrome| M|732758368.8| 0|home_essentials|2/22/1976| Aaron P Maashoh|aaron.maashoh@yah...| | 333320| 6/7/2015 20:39| 6/8/2015 1:38| 96399|EOGFQPIZPYXFZ| Ads| Chrome| F|350311387.9| 0| apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...| +-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+ only showing top 2 rows
In [69]:
# We should convert number value column to Integer/float in order to do such operations
# , but even if variable is string, it will work fine
df = df.withColumn("purchase_value", col("purchase_value").cast("integer"))
above = df.filter(col("purchase_value") > 90000)
min1 = above.select(min("purchase_value")).collect()[0][0]
print(min1)
90023
Extract rows with multiple conditions using operators " | ". It filters data similar to OR operater in where caluse of SAS/SQL.¶
In [70]:
chrm_ie = df.filter((col("browser") == "Chrome") | (col("browser") == "IE"))
c1=chrm_ie.groupBy("browser").count()
c1.show()
+-------+-----+ |browser|count| +-------+-----+ | IE| 260| | Chrome| 397| +-------+-----+
.isin() : multiple "OR" conditions over a single column¶
In [71]:
chrm_ie = df[df.browser.isin(["Chrome", "IE"])] #The code here is exactly as same as that of pandas
c1=chrm_ie.groupBy("browser").count()
c1.show()
+-------+-----+ |browser|count| +-------+-----+ | IE| 260| | Chrome| 397| +-------+-----+
In [72]:
chrm_ie = df.filter((col("browser") == "Chrome") & (col("purchase_value") > 90000))
max1= chrm_ie.select(max("purchase_value")).collect()[0][0]
print(max1)
99743
use "between" to filter a numerical column between a range of values¶
In [73]:
btw1 = df[df['purchase_value'].between(80000, 90000)] #The code here is exactly as same as that of pandas
btw1.count()
Out[73]:
106
In [74]:
#Another way to write this
btw2 = df.filter(col("purchase_value").between(80000, 90000))
btw2.count()
Out[74]:
106
In [75]:
#One more way to write this
btw = df.where(df.purchase_value.between(80000, 90000))
btw.count()
Out[75]:
106
How to use "contains" condition in where clause¶
In [76]:
above = df.filter(col("source").rlike("SE"))
c1=above.groupBy("source").count()
c1.show()
+------+-----+ |source|count| +------+-----+ | SEO| 399| +------+-----+
Unlike pandas, in PySpark we can use Like %% operators, YAY!!!¶
In [77]:
data_subset = df.filter(df.source.like("%SE%"))
data_subset.count()
Out[77]:
399
Module 6 : Derive a new column(s)¶
SELECT *, purchase_value/82.52 as convt FROM tips;¶
In [78]:
df = df.withColumn("Value_in_USD", format_string("$%.2f",col("purchase_value")/82.52))
df.select("purchase_value","Value_in_USD").show(5)
# How to read $%.2f : $ will add a dollar sign
# % means value
# .2f makes it round to two decimal places
+--------------+------------+ |purchase_value|Value_in_USD| +--------------+------------+ | 65278| $791.06| | 96399| $1168.19| | 57296| $694.33| | 43650| $528.96| | 45016| $545.52| +--------------+------------+ only showing top 5 rows
Add counter : column with incremental values from 1 onwards with range(start, stop, step)¶
In [79]:
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql.window import Window
start1 = 1 # starting value for increment_by_1 column
start2 = 0 # starting value for increment_by_5 column
window = Window.orderBy("user_id") # order by a unique column
df2 = df.withColumn("increment_by_1", row_number().over(window) + start1 - 1
).withColumn("increment_by_5", ((row_number().over(window) - 1) * 5) + start2)
df2.select("user_id","increment_by_1","increment_by_5").show(5)
+-------+--------------+--------------+ |user_id|increment_by_1|increment_by_5| +-------+--------------+--------------+ | 10055| 1| 0| | 10079| 2| 5| | 100886| 3| 10| | 100998| 4| 15| | 101564| 5| 20| +-------+--------------+--------------+ only showing top 5 rows
Module 6.1 :Text functions¶
use length function -- to finds the length of a character string¶
In [80]:
#from pyspark.sql.functions import length
#or better
from pyspark.sql.functions import *
df = df.withColumn('email_length',length('email'))
df.select(['email','email_length']).show(2, truncate = False)
+-------------------------+------------+ |email |email_length| +-------------------------+------------+ |aaron.maashoh@yahoo.com |23 | |rick_rothackerj@yahoo.com|25 | +-------------------------+------------+ only showing top 2 rows
How to change the case of a string¶
In [81]:
df = df.withColumn('email_all_caps',upper ('email'))
df = df.withColumn('email_all_lower',lower('email_all_caps'))
df = df.withColumn('email_propcase',initcap('email_all_caps'))
df.select(['email','email_all_caps','email_all_lower','email_propcase']).show(2, truncate = False)
+-------------------------+-------------------------+-------------------------+-------------------------+ |email |email_all_caps |email_all_lower |email_propcase | +-------------------------+-------------------------+-------------------------+-------------------------+ |aaron.maashoh@yahoo.com |AARON.MAASHOH@YAHOO.COM |aaron.maashoh@yahoo.com |Aaron.maashoh@yahoo.com | |rick_rothackerj@yahoo.com|RICK_ROTHACKERJ@YAHOO.COM|rick_rothackerj@yahoo.com|Rick_rothackerj@yahoo.com| +-------------------------+-------------------------+-------------------------+-------------------------+ only showing top 2 rows
instr and locate functions - finds the first occurenece position of a character in a column of strings¶
In [82]:
df = df.withColumn('Position_of_@',instr('email_propcase','@'))
df = df.withColumn('Position_of_a',instr('email_propcase','a'))
df = df.withColumn('Position_of_A_CAPS',instr('email_propcase','A'))
df.select(['email_propcase','Position_of_@','Position_of_a','Position_of_A_CAPS']).show(2,truncate = False)
+-------------------------+-------------+-------------+------------------+ |email_propcase |Position_of_@|Position_of_a|Position_of_A_CAPS| +-------------------------+-------------+-------------+------------------+ |Aaron.maashoh@yahoo.com |14 |2 |1 | |Rick_rothackerj@yahoo.com|16 |10 |0 | +-------------------------+-------------+-------------+------------------+ only showing top 2 rows
In [83]:
df = df.withColumn('Location_of_@',locate('@','email_propcase'))
df = df.withColumn('Location_of_a',locate('a','email_propcase'))
df = df.withColumn('Location_of_A_CAPS',locate('A','email_propcase'))
df.select(['email_propcase','Location_of_@','Location_of_a','Location_of_A_CAPS']).show(2,truncate = False)
+-------------------------+-------------+-------------+------------------+ |email_propcase |Location_of_@|Location_of_a|Location_of_A_CAPS| +-------------------------+-------------+-------------+------------------+ |Aaron.maashoh@yahoo.com |14 |2 |1 | |Rick_rothackerj@yahoo.com|16 |10 |0 | +-------------------------+-------------+-------------+------------------+ only showing top 2 rows
How to split a string on the basis of a delimiter, example portion in email before and after @¶
In [84]:
df = df.withColumn('email_part1',substring_index(df.email, '@', 1))
df = df.withColumn('email_part2',substring_index(df.email, '@', -1))
df.select(['email','email_part1','email_part2']).show(2,truncate = False)
+-------------------------+---------------+-----------+ |email |email_part1 |email_part2| +-------------------------+---------------+-----------+ |aaron.maashoh@yahoo.com |aaron.maashoh |yahoo.com | |rick_rothackerj@yahoo.com|rick_rothackerj|yahoo.com | +-------------------------+---------------+-----------+ only showing top 2 rows
Another way is using a split function¶
In [85]:
df.select('email',split(df.email,"@").getItem(0).alias("email_part1")
,split(df.email,"@").getItem(1).alias("email_part2")).show(5,truncate = False)
+--------------------------+---------------+-----------+ |email |email_part1 |email_part2| +--------------------------+---------------+-----------+ |aaron.maashoh@yahoo.com |aaron.maashoh |yahoo.com | |rick_rothackerj@yahoo.com |rick_rothackerj|yahoo.com | |harriet.mcleodd@gmail.com |harriet.mcleodd|gmail.com | |sinead.carews@gmail.com |sinead.carews |gmail.com | |laurence.frosty@tntech.edu|laurence.frosty|tntech.edu | +--------------------------+---------------+-----------+ only showing top 5 rows
let's now CONCATENATE the strings¶
In [86]:
df.select('email_part1','email_part2'
,concat(df.email_part1,lit("@"),df.email_part2).alias("FULL_EMAIL")).show(2,truncate = False)
+---------------+-----------+-------------------------+ |email_part1 |email_part2|FULL_EMAIL | +---------------+-----------+-------------------------+ |aaron.maashoh |yahoo.com |aaron.maashoh@yahoo.com | |rick_rothackerj|yahoo.com |rick_rothackerj@yahoo.com| +---------------+-----------+-------------------------+ only showing top 2 rows
In [87]:
df.select('email_part1','email_part2',concat_ws('@',df.email_part1,df.email_part2).alias("FULL_EMAIL")).show(2,truncate = False)
+---------------+-----------+-------------------------+ |email_part1 |email_part2|FULL_EMAIL | +---------------+-----------+-------------------------+ |aaron.maashoh |yahoo.com |aaron.maashoh@yahoo.com | |rick_rothackerj|yahoo.com |rick_rothackerj@yahoo.com| +---------------+-----------+-------------------------+ only showing top 2 rows
Using a substr function in Pyspark¶
In [88]:
df = df.withColumn('email_first_3',df.email.substr(1,3))
df.select(['email','email_first_3']).show(2,truncate = False)
+-------------------------+-------------+ |email |email_first_3| +-------------------------+-------------+ |aaron.maashoh@yahoo.com |aar | |rick_rothackerj@yahoo.com|ric | +-------------------------+-------------+ only showing top 2 rows
Another method to use these functions is with a 'select' statement, which is quite simple and SQL style¶
In [89]:
df1 = df.select('email',substring(df.email, 1, 3).alias('email_first_3'))
df1.show(5,truncate = False)
+--------------------------+-------------+ |email |email_first_3| +--------------------------+-------------+ |aaron.maashoh@yahoo.com |aar | |rick_rothackerj@yahoo.com |ric | |harriet.mcleodd@gmail.com |har | |sinead.carews@gmail.com |sin | |laurence.frosty@tntech.edu|lau | +--------------------------+-------------+ only showing top 5 rows
Translate function works same as it works in SAS¶
In [90]:
df1 = df.select('email',translate(df.email, 'a', 'X').alias('email_translated'))
df1.show(5,truncate = False)
+--------------------------+--------------------------+ |email |email_translated | +--------------------------+--------------------------+ |aaron.maashoh@yahoo.com |XXron.mXXshoh@yXhoo.com | |rick_rothackerj@yahoo.com |rick_rothXckerj@yXhoo.com | |harriet.mcleodd@gmail.com |hXrriet.mcleodd@gmXil.com | |sinead.carews@gmail.com |sineXd.cXrews@gmXil.com | |laurence.frosty@tntech.edu|lXurence.frosty@tntech.edu| +--------------------------+--------------------------+ only showing top 5 rows
Following is the SAS Tranwrd equivalent, but with regular expression it becomes much more versatile.¶
In [91]:
df1 = df.select('email',regexp_replace(df.email, ('@\w+\.\w+'), '@some_domain').alias('regex_replaced'))
df1.show(5,truncate = False)
+--------------------------+---------------------------+ |email |regex_replaced | +--------------------------+---------------------------+ |aaron.maashoh@yahoo.com |aaron.maashoh@some_domain | |rick_rothackerj@yahoo.com |rick_rothackerj@some_domain| |harriet.mcleodd@gmail.com |harriet.mcleodd@some_domain| |sinead.carews@gmail.com |sinead.carews@some_domain | |laurence.frosty@tntech.edu|laurence.frosty@some_domain| +--------------------------+---------------------------+ only showing top 5 rows
In [92]:
import pyspark.pandas as ps
# We can convert a pyspark data to Pandas dataframe, and vice versa as well
PYARROW_IGNORE_TIMEZONE = 0
result_pdf = df1.toPandas()
In [93]:
result_pdf.head(5)
Out[93]:
regex_replaced | ||
---|---|---|
0 | aaron.maashoh@yahoo.com | aaron.maashoh@some_domain |
1 | rick_rothackerj@yahoo.com | rick_rothackerj@some_domain |
2 | harriet.mcleodd@gmail.com | harriet.mcleodd@some_domain |
3 | sinead.carews@gmail.com | sinead.carews@some_domain |
4 | laurence.frosty@tntech.edu | laurence.frosty@some_domain |
Module 7 : Regular Expressions (popular by name "regex")¶
To tackle textual-data-related tasks such as web scraping, sentiment analysis, and string manipulation, regex is a crucial tool to learn¶
Regular Expression is a sequence of characters that make up a pattern to find, replace, and extract textual data.¶
The built-in library in Python for handling regular expression is "re"¶
In [94]:
text_1 = ['There was further spike in negative sentiments with the outbreak.']
In [95]:
print(text_1)
['There was further spike in negative sentiments with the outbreak.']
In [96]:
# Please install if not already installed
# !pip install python-docx
import docx
In [97]:
#Convert the Word file into a text file using a library like python-docx:
doc = docx.Document('input_datsets/word1.docx')
# extract the text from the Word file
text = '\n'.join([paragraph.text for paragraph in doc.paragraphs])
# write the text to a file
with open('my_text_file.txt', 'w') as f:
f.write(text)
In [98]:
#Read the text file into a PySpark DataFrame using spark.read.text().
df = spark.read.text('my_text_file.txt')
# show the contents of the DataFrame
df.show(truncate=False)
+-----------------------------------------------------------------+ |value | +-----------------------------------------------------------------+ |There was further spike in negative sentiments with the outbreak.| +-----------------------------------------------------------------+
In [99]:
# We have already explained "Regex" in detail in our Panda's blog.
# Regex can also be used in pyspark, basic remnaining same.
# We can use the functions such as regexp_extract, regex_replace (already explained in the Pandas blog)
In [100]:
import pandas as pd
data = {'Customer_id': ['cus_01', 'cus_02', 'cus_03', 'cus_04'],
'email': ['name.surname@gmail.com','anonymous123@yahoo.co.ukk','anonymous123@.com','a..@gmail.com']
}
pd_df = pd.DataFrame(data)
pd_df
Out[100]:
Customer_id | ||
---|---|---|
0 | cus_01 | name.surname@gmail.com |
1 | cus_02 | anonymous123@yahoo.co.ukk |
2 | cus_03 | anonymous123@.com |
3 | cus_04 | a..@gmail.com |
In [101]:
# let's convert the Pandas data to Pyspark
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df_py = spark.createDataFrame(pd_df)
In [102]:
df_py.show(truncate= False)
+-----------+-------------------------+ |Customer_id|email | +-----------+-------------------------+ |cus_01 |name.surname@gmail.com | |cus_02 |anonymous123@yahoo.co.ukk| |cus_03 |anonymous123@.com | |cus_04 |a..@gmail.com | +-----------+-------------------------+
In [103]:
df_py.select('Customer_id', 'email'
, regexp_extract(df_py.email, '(\w+[._-]*\w*)', 1).alias('first_portion')).show(truncate= False)
+-----------+-------------------------+-------------+ |Customer_id|email |first_portion| +-----------+-------------------------+-------------+ |cus_01 |name.surname@gmail.com |name.surname | |cus_02 |anonymous123@yahoo.co.ukk|anonymous123 | |cus_03 |anonymous123@.com |anonymous123 | |cus_04 |a..@gmail.com |a.. | +-----------+-------------------------+-------------+
In [104]:
df_py.select('Customer_id', 'email'
, regexp_extract(df_py.email, '@(\w+)',1).alias('second_portion')).show(truncate= False)
+-----------+-------------------------+--------------+ |Customer_id|email |second_portion| +-----------+-------------------------+--------------+ |cus_01 |name.surname@gmail.com |gmail | |cus_02 |anonymous123@yahoo.co.ukk|yahoo | |cus_03 |anonymous123@.com | | |cus_04 |a..@gmail.com |gmail | +-----------+-------------------------+--------------+
Module 8 : Handling the Date Time¶
In [105]:
# from pyspark.sql.functions import current_date, current_timestamp
df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
df_dt = df.select( 'signup_time'
, 'purchase_time'
, current_date().alias("todays_date")
, current_timestamp().alias("whats_time_now")
)
In [106]:
df_dt.show(4,truncate= False)
+---------------+--------------+-----------+-----------------------+ |signup_time |purchase_time |todays_date|whats_time_now | +---------------+--------------+-----------+-----------------------+ |2/24/2015 22:55|4/18/2015 2:47|2023-04-28 |2023-04-28 08:45:39.276| |6/7/2015 20:39 |6/8/2015 1:38 |2023-04-28 |2023-04-28 08:45:39.276| |1/1/2015 18:52 |1/1/2015 18:52|2023-04-28 |2023-04-28 08:45:39.276| |4/28/2015 21:13|5/4/2015 13:54|2023-04-28 |2023-04-28 08:45:39.276| +---------------+--------------+-----------+-----------------------+ only showing top 4 rows
In [107]:
df_dt.dtypes
# Please notice the variable types below
Out[107]:
[('signup_time', 'string'), ('purchase_time', 'string'), ('todays_date', 'date'), ('whats_time_now', 'timestamp')]
get date, time, date elements, time elements from date¶
In [108]:
df_dt1 = df_dt.select('*', to_timestamp('signup_time','M/d/y H:m').alias('converted_date'))
df_dt1.show(2,truncate = False)
+---------------+--------------+-----------+-----------------------+-------------------+ |signup_time |purchase_time |todays_date|whats_time_now |converted_date | +---------------+--------------+-----------+-----------------------+-------------------+ |2/24/2015 22:55|4/18/2015 2:47|2023-04-28 |2023-04-28 08:45:39.408|2015-02-24 22:55:00| |6/7/2015 20:39 |6/8/2015 1:38 |2023-04-28 |2023-04-28 08:45:39.408|2015-06-07 20:39:00| +---------------+--------------+-----------+-----------------------+-------------------+ only showing top 2 rows
In [109]:
# from pyspark.sql.functions import *
df_dt1.select('converted_date'
, to_date('converted_date').alias('datepart')
, year('converted_date').alias('year')
, month('converted_date').alias('month')
, dayofmonth('converted_date').alias('day')
, dayofweek('converted_date').alias("dayofweek")
, hour('converted_date').alias('hour')
, minute('converted_date').alias('min')
, second('converted_date').alias('sec')
).show(2)
+-------------------+----------+----+-----+---+---------+----+---+---+ | converted_date| datepart|year|month|day|dayofweek|hour|min|sec| +-------------------+----------+----+-----+---+---------+----+---+---+ |2015-02-24 22:55:00|2015-02-24|2015| 2| 24| 3| 22| 55| 0| |2015-06-07 20:39:00|2015-06-07|2015| 6| 7| 1| 20| 39| 0| +-------------------+----------+----+-----+---+---------+----+---+---+ only showing top 2 rows
Start and end date of the time series data¶
In [110]:
conv_dt = df_dt1.agg(min("converted_date")
, max("converted_date")
, (max("converted_date") - min("converted_date")).alias("date_differece"))
conv_dt.show(truncate = False)
+-------------------+-------------------+-------------------------------------+ |min(converted_date)|max(converted_date)|date_differece | +-------------------+-------------------+-------------------------------------+ |2015-01-01 04:43:00|2015-08-18 02:36:00|INTERVAL '228 21:53:00' DAY TO SECOND| +-------------------+-------------------+-------------------------------------+
Check if year is leap year¶
In [111]:
def is_leap_year(year_col):
return when((year(year_col) % 4 == 0) & ((year(year_col) % 100 != 0) | (year(year_col) % 400 == 0)), True).otherwise(False)
leap_year = df_dt1.withColumn("is_leap_year", is_leap_year(("converted_date")))
leap_year.show(5, truncate = False)
+---------------+--------------+-----------+-----------------------+-------------------+------------+ |signup_time |purchase_time |todays_date|whats_time_now |converted_date |is_leap_year| +---------------+--------------+-----------+-----------------------+-------------------+------------+ |2/24/2015 22:55|4/18/2015 2:47|2023-04-28 |2023-04-28 08:45:39.916|2015-02-24 22:55:00|false | |6/7/2015 20:39 |6/8/2015 1:38 |2023-04-28 |2023-04-28 08:45:39.916|2015-06-07 20:39:00|false | |1/1/2015 18:52 |1/1/2015 18:52|2023-04-28 |2023-04-28 08:45:39.916|2015-01-01 18:52:00|false | |4/28/2015 21:13|5/4/2015 13:54|2023-04-28 |2023-04-28 08:45:39.916|2015-04-28 21:13:00|false | |7/21/2015 7:09 |9/9/2015 18:40|2023-04-28 |2023-04-28 08:45:39.916|2015-07-21 07:09:00|false | +---------------+--------------+-----------+-----------------------+-------------------+------------+ only showing top 5 rows
In [112]:
# Let's calculate AGE
df1 = df.withColumn("signup_date",to_date(to_timestamp("signup_time", "M/d/y H:m"), "yyyy-MM-dd"))
df1 = df1.withColumn("dob_date", to_date("dob", "M/d/y"))
df2 = df1.withColumn("age_at_signup_days", datediff("signup_date", "dob_date")/365.25)
df2.select("signup_time","signup_date", "dob", "dob_date", "age_at_signup_days").show(5, truncate = False)
+---------------+-----------+---------+----------+------------------+ |signup_time |signup_date|dob |dob_date |age_at_signup_days| +---------------+-----------+---------+----------+------------------+ |2/24/2015 22:55|2015-02-24 |2/22/1976|1976-02-22|39.006160164271044| |6/7/2015 20:39 |2015-06-07 |1/2/1962 |1962-01-02|53.426420260095824| |1/1/2015 18:52 |2015-01-01 |3/5/1962 |1962-03-05|52.82683093771389 | |4/28/2015 21:13|2015-04-28 |7/3/1974 |1974-07-03|40.818617385352496| |7/21/2015 7:09 |2015-07-21 |8/25/1970|1970-08-25|44.903490759753595| +---------------+-----------+---------+----------+------------------+ only showing top 5 rows
In [113]:
df_dt1
Out[113]:
DataFrame[signup_time: string, purchase_time: string, todays_date: date, whats_time_now: timestamp, converted_date: timestamp]
Subset data between two dates¶
In [114]:
subset = df.filter((col('dob') >= '1985-02-15') & (col('dob') <= '2010-02-28'))
subset.show(2)
+-------+---------------+---------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+---------------+--------------------+ |user_id| signup_time| purchase_time|purchase_value| device_id|source|browser|sex| ip_address|class| category| dob| name| email| +-------+---------------+---------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+---------------+--------------------+ | 22058|2/24/2015 22:55| 4/18/2015 2:47| 65278|QVPSPJUOCKZAR| SEO| Chrome| M|732758368.8| 0|home_essentials|2/22/1976|Aaron P Maashoh|aaron.maashoh@yah...| | 182338|1/25/2015 17:49|3/23/2015 23:05| 37926|NRFFPPHZYFUVC| Ads| IE| M|341674739.6| 0| apparels| 2/6/1984| Ra J Alperx| ra&alperx@gmail.com| +-------+---------------+---------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+---------------+--------------------+ only showing top 2 rows
Start/End of previous quarter/month/year¶
In [115]:
import pandas as pd
from datetime import *
# create a list of dates
dates = [date(2022, 1, 1), date(2022, 1, 2), date(2022, 1, 3)]
# create a DataFrame with a date column
df = pd.DataFrame({'date_col': dates})
In [116]:
df
Out[116]:
date_col | |
---|---|
0 | 2022-01-01 |
1 | 2022-01-02 |
2 | 2022-01-03 |
In [117]:
df1 = spark.createDataFrame(df)
df1 = df1.withColumn('date', to_timestamp('date_col', 'yyyy-MM-dd'))
# print the PySpark DataFrame schema and contents
df1.printSchema()
df1.show()
root |-- date_col: date (nullable = true) |-- date: timestamp (nullable = true) +----------+-------------------+ | date_col| date| +----------+-------------------+ |2022-01-01|2022-01-01 00:00:00| |2022-01-02|2022-01-02 00:00:00| |2022-01-03|2022-01-03 00:00:00| +----------+-------------------+
We can move the date - intnx funcationality in SAS or Trunc in Oracle SQL?¶
In [118]:
# Let's derive:
# CMB >>>> Current month beginning
# CME >>>> Current month ending
# PMB >>>> Previous month beginning
# PME >>>> Previous month ending
# PYB >>>> Previous year beginning
# PYE >>>> Previous year ending
# PQB >>>> Previous quarter beginning
# PQE >>>> Previous quarter ending
# It is very much Oracle SQL style.
In [119]:
df_dt1.show(2)
+---------------+--------------+-----------+--------------------+-------------------+ | signup_time| purchase_time|todays_date| whats_time_now| converted_date| +---------------+--------------+-----------+--------------------+-------------------+ |2/24/2015 22:55|4/18/2015 2:47| 2023-04-28|2023-04-28 08:45:...|2015-02-24 22:55:00| | 6/7/2015 20:39| 6/8/2015 1:38| 2023-04-28|2023-04-28 08:45:...|2015-06-07 20:39:00| +---------------+--------------+-----------+--------------------+-------------------+ only showing top 2 rows
In [120]:
df_dt1.select('converted_date'
, trunc('converted_date','Month').alias('CMB')
, last_day('converted_date').alias('CME')
, trunc(add_months('converted_date',-1),'Month').alias('PMB')
, (trunc('converted_date','Month')-1).alias('PME')
, trunc((trunc('converted_date','year')-1),'year').alias('PYB')
, (trunc('converted_date','year')-1).alias('PYE')
, trunc((trunc('converted_date','quarter')-1),'quarter').alias('PQB')
, (trunc('converted_date','quarter')-1).alias('PQE')
).show(2)
+-------------------+----------+----------+----------+----------+----------+----------+----------+----------+ | converted_date| CMB| CME| PMB| PME| PYB| PYE| PQB| PQE| +-------------------+----------+----------+----------+----------+----------+----------+----------+----------+ |2015-02-24 22:55:00|2015-02-01|2015-02-28|2015-01-01|2015-01-31|2014-01-01|2014-12-31|2014-10-01|2014-12-31| |2015-06-07 20:39:00|2015-06-01|2015-06-30|2015-05-01|2015-05-31|2014-01-01|2014-12-31|2015-01-01|2015-03-31| +-------------------+----------+----------+----------+----------+----------+----------+----------+----------+ only showing top 2 rows
In [121]:
#Applying current date
df_dt1.select( trunc(current_date(),'Month').alias('CMB')
, last_day(current_date()).alias('CME')
, trunc(add_months(current_date(),-1),'Month').alias('PMB')
, (trunc(current_date(),'Month')-1).alias('PME')
, trunc((trunc(current_date(),'year')-1),'year').alias('PYB')
, (trunc(current_date(),'year')-1).alias('PYE')
, trunc((trunc(current_date(),'quarter')-1),'quarter').alias('PQB')
, (trunc(current_date(),'quarter')-1).alias('PQE')
).show(2)
+----------+----------+----------+----------+----------+----------+----------+----------+ | CMB| CME| PMB| PME| PYB| PYE| PQB| PQE| +----------+----------+----------+----------+----------+----------+----------+----------+ |2023-04-01|2023-04-30|2023-03-01|2023-03-31|2022-01-01|2022-12-31|2023-01-01|2023-03-31| |2023-04-01|2023-04-30|2023-03-01|2023-03-31|2022-01-01|2022-12-31|2023-01-01|2023-03-31| +----------+----------+----------+----------+----------+----------+----------+----------+ only showing top 2 rows
Let's convert a date to charcter and back to date¶
In [122]:
df_dt2 = df_dt1.select('converted_date'
, concat(year('converted_date') ,lit('-'), month('converted_date') ,lit('-'), dayofmonth('converted_date')
).alias('char_date'))
In [123]:
df_dt2.show(2)
+-------------------+---------+ | converted_date|char_date| +-------------------+---------+ |2015-02-24 22:55:00|2015-2-24| |2015-06-07 20:39:00| 2015-6-7| +-------------------+---------+ only showing top 2 rows
In [124]:
df_dt2.dtypes
Out[124]:
[('converted_date', 'timestamp'), ('char_date', 'string')]
In [125]:
df_dt2 = df_dt2.select('*'
,to_date('char_date').alias('back_to_date'))
In [126]:
df_dt2.show(2)
+-------------------+---------+------------+ | converted_date|char_date|back_to_date| +-------------------+---------+------------+ |2015-02-24 22:55:00|2015-2-24| 2015-02-24| |2015-06-07 20:39:00| 2015-6-7| 2015-06-07| +-------------------+---------+------------+ only showing top 2 rows
In [127]:
df_dt2.dtypes
Out[127]:
[('converted_date', 'timestamp'), ('char_date', 'string'), ('back_to_date', 'date')]
Module 9 : Arithmetic Operations on PySpark DataFrame¶
In [128]:
df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
In [129]:
sample_1 = df.select('user_id','browser','sex','class')
sample_1 = sample_1.withColumn('class_plus_10',col('class') + 10)
sample_1 = sample_1.withColumn('class_inverse',1 - col('class') )
sample_1 = sample_1.withColumn('class_multiplied_2',col('class') * 10)
sample_1 = sample_1.withColumn('class_divided',col('class') / 2)
sample_1 = sample_1.withColumn('mod_remainder',(col('class')+3) % 2)
sample_1 = sample_1.select('*',pow('class_plus_10',2).alias('powered'))
sample_1.show(5)
+-------+-------+---+-----+-------------+-------------+------------------+-------------+-------------+-------+ |user_id|browser|sex|class|class_plus_10|class_inverse|class_multiplied_2|class_divided|mod_remainder|powered| +-------+-------+---+-----+-------------+-------------+------------------+-------------+-------------+-------+ | 22058| Chrome| M| 0| 10.0| 1.0| 0.0| 0.0| 1.0| 100.0| | 333320| Chrome| F| 0| 10.0| 1.0| 0.0| 0.0| 1.0| 100.0| | 1359| Opera| M| 1| 11.0| 0.0| 10.0| 0.5| 0.0| 121.0| | 150084| Safari| M| 0| 10.0| 1.0| 0.0| 0.0| 1.0| 100.0| | 221365| Safari| M| 0| 10.0| 1.0| 0.0| 0.0| 1.0| 100.0| +-------+-------+---+-----+-------------+-------------+------------------+-------------+-------------+-------+ only showing top 5 rows
Module 10 : GroupBy() Method for PySpark DataFrame¶
In [130]:
df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
df = df.withColumn("class", col("class").cast("integer"))
In [131]:
sample_1 = df.select('user_id','browser','sex','class')
sample_1.show(5)
+-------+-------+---+-----+ |user_id|browser|sex|class| +-------+-------+---+-----+ | 22058| Chrome| M| 0| | 333320| Chrome| F| 0| | 1359| Opera| M| 1| | 150084| Safari| M| 0| | 221365| Safari| M| 0| +-------+-------+---+-----+ only showing top 5 rows
In [132]:
sample_1.dtypes
Out[132]:
[('user_id', 'string'), ('browser', 'string'), ('sex', 'string'), ('class', 'int')]
In [133]:
summary_data= df.groupBy('browser','sex').sum("class")
summary_data.show()
+-------+----+----------+ |browser| sex|sum(class)| +-------+----+----------+ | Opera| M| 2| | Safari| F| 4| |FireFox|null| 1| | Opera| F| 2| | Chrome| M| 23| |FireFox| M| 11| | Chrome| F| 11| | IE|null| 1| | Chrome|null| 1| | IE| F| 12| | IE| M| 17| |FireFox| F| 3| | Safari| M| 13| | Opera|null| 0| +-------+----+----------+
In [134]:
# here is a better way:
from pyspark.sql.functions import sum,avg,max,min,mean,count
summary_data = sample_1.groupBy('browser','sex').agg(sum('class').alias('sum_class') \
, mean('class').alias('avg_class') \
, min('class').alias('min_class') \
, max('class').alias('max_class') \
, count('class').alias('count_class')
, countDistinct('class').alias('count_distinct')
)
summary_data.show()
+-------+----+---------+--------------------+---------+---------+-----------+--------------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+----+---------+--------------------+---------+---------+-----------+--------------+ | Opera| M| 2| 0.18181818181818182| 0| 1| 11| 2| | Safari| F| 4|0.058823529411764705| 0| 1| 68| 2| |FireFox|null| 1| 0.2| 0| 1| 5| 2| | Opera| F| 2| 0.25| 0| 1| 8| 2| | Chrome| M| 23| 0.10550458715596331| 0| 1| 218| 2| |FireFox| M| 11| 0.12359550561797752| 0| 1| 89| 2| | Chrome| F| 11| 0.0658682634730539| 0| 1| 167| 2| | IE|null| 1| 0.16666666666666666| 0| 1| 6| 2| | Chrome|null| 1| 0.08333333333333333| 0| 1| 12| 2| | IE| F| 12| 0.10909090909090909| 0| 1| 110| 2| | IE| M| 17| 0.11805555555555555| 0| 1| 144| 2| |FireFox| F| 3| 0.04918032786885246| 0| 1| 61| 2| | Safari| M| 13| 0.1326530612244898| 0| 1| 98| 2| | Opera|null| 0| 0.0| 0| 0| 2| 1| +-------+----+---------+--------------------+---------+---------+-----------+--------------+
Module 11: Another way to use where condition, first method was explained in Module 5¶
In [135]:
# Here we are learning filtering based on:
# > one criterion
# > AND/OR and NOT condition and filter based on multiple criteria
# > NULL, NOT NULL
# > in (list)
# > startwith and endwith
# > contains
# > like
# > rlike ( for regular expression)
# We can use both keywords >> filter() and where()
In [136]:
summary_data.filter("sex == 'M'").show()
+-------+---+---------+-------------------+---------+---------+-----------+--------------+ |browser|sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+---+---------+-------------------+---------+---------+-----------+--------------+ | Opera| M| 2|0.18181818181818182| 0| 1| 11| 2| | Chrome| M| 23|0.10550458715596331| 0| 1| 218| 2| |FireFox| M| 11|0.12359550561797752| 0| 1| 89| 2| | IE| M| 17|0.11805555555555555| 0| 1| 144| 2| | Safari| M| 13| 0.1326530612244898| 0| 1| 98| 2| +-------+---+---------+-------------------+---------+---------+-----------+--------------+
In [137]:
summary_data.filter((summary_data.sex == "M") & (summary_data.browser == "Chrome") ).show()
+-------+---+---------+-------------------+---------+---------+-----------+--------------+ |browser|sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+---+---------+-------------------+---------+---------+-----------+--------------+ | Chrome| M| 23|0.10550458715596331| 0| 1| 218| 2| +-------+---+---------+-------------------+---------+---------+-----------+--------------+
In [138]:
summary_data.filter((summary_data.sex == "M") | (summary_data.browser == "Chrome") ).show()
+-------+----+---------+-------------------+---------+---------+-----------+--------------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+----+---------+-------------------+---------+---------+-----------+--------------+ | Opera| M| 2|0.18181818181818182| 0| 1| 11| 2| | Chrome| M| 23|0.10550458715596331| 0| 1| 218| 2| |FireFox| M| 11|0.12359550561797752| 0| 1| 89| 2| | Chrome| F| 11| 0.0658682634730539| 0| 1| 167| 2| | Chrome|null| 1|0.08333333333333333| 0| 1| 12| 2| | IE| M| 17|0.11805555555555555| 0| 1| 144| 2| | Safari| M| 13| 0.1326530612244898| 0| 1| 98| 2| +-------+----+---------+-------------------+---------+---------+-----------+--------------+
In [139]:
summary_data.filter((summary_data.sex != "M")).show()
+-------+---+---------+--------------------+---------+---------+-----------+--------------+ |browser|sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+---+---------+--------------------+---------+---------+-----------+--------------+ | Safari| F| 4|0.058823529411764705| 0| 1| 68| 2| | Opera| F| 2| 0.25| 0| 1| 8| 2| | Chrome| F| 11| 0.0658682634730539| 0| 1| 167| 2| | IE| F| 12| 0.10909090909090909| 0| 1| 110| 2| |FireFox| F| 3| 0.04918032786885246| 0| 1| 61| 2| +-------+---+---------+--------------------+---------+---------+-----------+--------------+
In [140]:
summary_data.filter( "sex is NULL AND browser is not NULL").show()
+-------+----+---------+-------------------+---------+---------+-----------+--------------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+----+---------+-------------------+---------+---------+-----------+--------------+ |FireFox|null| 1| 0.2| 0| 1| 5| 2| | IE|null| 1|0.16666666666666666| 0| 1| 6| 2| | Chrome|null| 1|0.08333333333333333| 0| 1| 12| 2| | Opera|null| 0| 0.0| 0| 0| 2| 1| +-------+----+---------+-------------------+---------+---------+-----------+--------------+
In [141]:
summary_data.filter( "sex is NULL OR browser is not NULL").show()
+-------+----+---------+--------------------+---------+---------+-----------+--------------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+----+---------+--------------------+---------+---------+-----------+--------------+ | Opera| M| 2| 0.18181818181818182| 0| 1| 11| 2| | Safari| F| 4|0.058823529411764705| 0| 1| 68| 2| |FireFox|null| 1| 0.2| 0| 1| 5| 2| | Opera| F| 2| 0.25| 0| 1| 8| 2| | Chrome| M| 23| 0.10550458715596331| 0| 1| 218| 2| |FireFox| M| 11| 0.12359550561797752| 0| 1| 89| 2| | Chrome| F| 11| 0.0658682634730539| 0| 1| 167| 2| | IE|null| 1| 0.16666666666666666| 0| 1| 6| 2| | Chrome|null| 1| 0.08333333333333333| 0| 1| 12| 2| | IE| F| 12| 0.10909090909090909| 0| 1| 110| 2| | IE| M| 17| 0.11805555555555555| 0| 1| 144| 2| |FireFox| F| 3| 0.04918032786885246| 0| 1| 61| 2| | Safari| M| 13| 0.1326530612244898| 0| 1| 98| 2| | Opera|null| 0| 0.0| 0| 0| 2| 1| +-------+----+---------+--------------------+---------+---------+-----------+--------------+
In [142]:
list_1 = ["Chrome","FireFox"]
summary_data.filter(summary_data.browser.isin(list_1)).show()
+-------+----+---------+-------------------+---------+---------+-----------+--------------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+----+---------+-------------------+---------+---------+-----------+--------------+ |FireFox|null| 1| 0.2| 0| 1| 5| 2| | Chrome| M| 23|0.10550458715596331| 0| 1| 218| 2| |FireFox| M| 11|0.12359550561797752| 0| 1| 89| 2| | Chrome| F| 11| 0.0658682634730539| 0| 1| 167| 2| | Chrome|null| 1|0.08333333333333333| 0| 1| 12| 2| |FireFox| F| 3|0.04918032786885246| 0| 1| 61| 2| +-------+----+---------+-------------------+---------+---------+-----------+--------------+
In [143]:
list_1 = ["Chrome","FireFox"]
summary_data.filter(summary_data.browser.isin(list_1) == False).show()
+-------+----+---------+--------------------+---------+---------+-----------+--------------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+----+---------+--------------------+---------+---------+-----------+--------------+ | Opera| M| 2| 0.18181818181818182| 0| 1| 11| 2| | Safari| F| 4|0.058823529411764705| 0| 1| 68| 2| | Opera| F| 2| 0.25| 0| 1| 8| 2| | IE|null| 1| 0.16666666666666666| 0| 1| 6| 2| | IE| F| 12| 0.10909090909090909| 0| 1| 110| 2| | IE| M| 17| 0.11805555555555555| 0| 1| 144| 2| | Safari| M| 13| 0.1326530612244898| 0| 1| 98| 2| | Opera|null| 0| 0.0| 0| 0| 2| 1| +-------+----+---------+--------------------+---------+---------+-----------+--------------+
In [144]:
list_1 = ["Chrome","FireFox"]
summary_data.filter(~summary_data.browser.isin(list_1)).show()
+-------+----+---------+--------------------+---------+---------+-----------+--------------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+----+---------+--------------------+---------+---------+-----------+--------------+ | Opera| M| 2| 0.18181818181818182| 0| 1| 11| 2| | Safari| F| 4|0.058823529411764705| 0| 1| 68| 2| | Opera| F| 2| 0.25| 0| 1| 8| 2| | IE|null| 1| 0.16666666666666666| 0| 1| 6| 2| | IE| F| 12| 0.10909090909090909| 0| 1| 110| 2| | IE| M| 17| 0.11805555555555555| 0| 1| 144| 2| | Safari| M| 13| 0.1326530612244898| 0| 1| 98| 2| | Opera|null| 0| 0.0| 0| 0| 2| 1| +-------+----+---------+--------------------+---------+---------+-----------+--------------+
In [145]:
summary_data.filter(summary_data.browser.startswith("O")).show()
+-------+----+---------+-------------------+---------+---------+-----------+--------------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+----+---------+-------------------+---------+---------+-----------+--------------+ | Opera| M| 2|0.18181818181818182| 0| 1| 11| 2| | Opera| F| 2| 0.25| 0| 1| 8| 2| | Opera|null| 0| 0.0| 0| 0| 2| 1| +-------+----+---------+-------------------+---------+---------+-----------+--------------+
In [146]:
summary_data.filter(summary_data.browser.endswith("e")).show()
+-------+----+---------+-------------------+---------+---------+-----------+--------------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+----+---------+-------------------+---------+---------+-----------+--------------+ | Chrome| M| 23|0.10550458715596331| 0| 1| 218| 2| | Chrome| F| 11| 0.0658682634730539| 0| 1| 167| 2| | Chrome|null| 1|0.08333333333333333| 0| 1| 12| 2| +-------+----+---------+-------------------+---------+---------+-----------+--------------+
In [147]:
summary_data.filter(summary_data.browser.contains("hr")).show()
+-------+----+---------+-------------------+---------+---------+-----------+--------------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+----+---------+-------------------+---------+---------+-----------+--------------+ | Chrome| M| 23|0.10550458715596331| 0| 1| 218| 2| | Chrome| F| 11| 0.0658682634730539| 0| 1| 167| 2| | Chrome|null| 1|0.08333333333333333| 0| 1| 12| 2| +-------+----+---------+-------------------+---------+---------+-----------+--------------+
In [148]:
summary_data.filter(summary_data.browser.like("%hr%")).show()
+-------+----+---------+-------------------+---------+---------+-----------+--------------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+----+---------+-------------------+---------+---------+-----------+--------------+ | Chrome| M| 23|0.10550458715596331| 0| 1| 218| 2| | Chrome| F| 11| 0.0658682634730539| 0| 1| 167| 2| | Chrome|null| 1|0.08333333333333333| 0| 1| 12| 2| +-------+----+---------+-------------------+---------+---------+-----------+--------------+
In [149]:
summary_data.filter(summary_data.browser.rlike("\w+Fox")).show()
+-------+----+---------+-------------------+---------+---------+-----------+--------------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+----+---------+-------------------+---------+---------+-----------+--------------+ |FireFox|null| 1| 0.2| 0| 1| 5| 2| |FireFox| M| 11|0.12359550561797752| 0| 1| 89| 2| |FireFox| F| 3|0.04918032786885246| 0| 1| 61| 2| +-------+----+---------+-------------------+---------+---------+-----------+--------------+
Module 12 : WHEN function in PySpark¶
In [150]:
df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
In [151]:
from pyspark.sql.functions import sum,avg,max,min,mean,count
summary_data = df.groupBy('browser','sex').agg(sum('class').alias('sum_class') \
, mean('class').alias('avg_class') \
, min('class').alias('min_class') \
, max('class').alias('max_class') \
, count('class').alias('count_class')
, countDistinct('class').alias('count_distinct')
)
summary_data.show(5)
+-------+----+---------+-------------------+---------+---------+-----------+--------------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct| +-------+----+---------+-------------------+---------+---------+-----------+--------------+ | Chrome|null| 1.0|0.08333333333333333| 0| 1| 12| 2| | Chrome| F| 11.0| 0.0658682634730539| 0| 1| 167| 2| | Chrome| M| 23.0|0.10550458715596331| 0| 1| 218| 2| |FireFox|null| 1.0| 0.2| 0| 1| 5| 2| |FireFox| F| 3.0|0.04918032786885246| 0| 1| 61| 2| +-------+----+---------+-------------------+---------+---------+-----------+--------------+ only showing top 5 rows
In [152]:
summary_data = summary_data.withColumn('gender', when(summary_data.sex == "M", "Male")
.when(summary_data.sex == "F", "Female")
.when(summary_data.sex.isNull(), "Other")
.otherwise("check") )
summary_data.show(5)
+-------+----+---------+-------------------+---------+---------+-----------+--------------+------+ |browser| sex|sum_class| avg_class|min_class|max_class|count_class|count_distinct|gender| +-------+----+---------+-------------------+---------+---------+-----------+--------------+------+ | Chrome|null| 1.0|0.08333333333333333| 0| 1| 12| 2| Other| | Chrome| F| 11.0| 0.0658682634730539| 0| 1| 167| 2|Female| | Chrome| M| 23.0|0.10550458715596331| 0| 1| 218| 2| Male| |FireFox|null| 1.0| 0.2| 0| 1| 5| 2| Other| |FireFox| F| 3.0|0.04918032786885246| 0| 1| 61| 2|Female| +-------+----+---------+-------------------+---------+---------+-----------+--------------+------+ only showing top 5 rows
Module 13 : rank(), dense_rank(), row_number(), percent_rank() in PySpark¶
In [153]:
# These three are quite pouplar functions in SQL, which are used to create rank/counter variables
In [154]:
data_rank= spark.read.option('header','true').csv('input_datsets/Rank_data.csv')
data_rank.show()
+------+---------+-----+ | Name| Subject|Marks| +------+---------+-----+ | Sam| Physics| 76| | Sam|Chemistry| 90| | Sam| Maths| 90| | Robby| Physics| 89| | Robby|Chemistry| 90| | Robby| Maths| 87| |Merlin| Physics| 84| |Merlin|Chemistry| 95| |Merlin| Maths| 90| +------+---------+-----+
In [155]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("Subject").orderBy(col("Marks").desc())
data_rank = data_rank.withColumn('Row_Number',row_number().over(windowSpec))
data_rank.show()
+------+---------+-----+----------+ | Name| Subject|Marks|Row_Number| +------+---------+-----+----------+ |Merlin|Chemistry| 95| 1| | Sam|Chemistry| 90| 2| | Robby|Chemistry| 90| 3| | Sam| Maths| 90| 1| |Merlin| Maths| 90| 2| | Robby| Maths| 87| 3| | Robby| Physics| 89| 1| |Merlin| Physics| 84| 2| | Sam| Physics| 76| 3| +------+---------+-----+----------+
In [156]:
data_rank = data_rank.withColumn('RANK',rank().over(windowSpec))
data_rank.show()
+------+---------+-----+----------+----+ | Name| Subject|Marks|Row_Number|RANK| +------+---------+-----+----------+----+ |Merlin|Chemistry| 95| 1| 1| | Sam|Chemistry| 90| 2| 2| | Robby|Chemistry| 90| 3| 2| | Sam| Maths| 90| 1| 1| |Merlin| Maths| 90| 2| 1| | Robby| Maths| 87| 3| 3| | Robby| Physics| 89| 1| 1| |Merlin| Physics| 84| 2| 2| | Sam| Physics| 76| 3| 3| +------+---------+-----+----------+----+
In [157]:
data_rank = data_rank.withColumn('DENSE_RANK',dense_rank().over(windowSpec))
data_rank.show()
+------+---------+-----+----------+----+----------+ | Name| Subject|Marks|Row_Number|RANK|DENSE_RANK| +------+---------+-----+----------+----+----------+ |Merlin|Chemistry| 95| 1| 1| 1| | Sam|Chemistry| 90| 2| 2| 2| | Robby|Chemistry| 90| 3| 2| 2| | Sam| Maths| 90| 1| 1| 1| |Merlin| Maths| 90| 2| 1| 1| | Robby| Maths| 87| 3| 3| 2| | Robby| Physics| 89| 1| 1| 1| |Merlin| Physics| 84| 2| 2| 2| | Sam| Physics| 76| 3| 3| 3| +------+---------+-----+----------+----+----------+
In [158]:
data_rank = data_rank.withColumn('PERCENT_RANK',percent_rank().over(windowSpec))
data_rank.show()
+------+---------+-----+----------+----+----------+------------+ | Name| Subject|Marks|Row_Number|RANK|DENSE_RANK|PERCENT_RANK| +------+---------+-----+----------+----+----------+------------+ |Merlin|Chemistry| 95| 1| 1| 1| 0.0| | Sam|Chemistry| 90| 2| 2| 2| 0.5| | Robby|Chemistry| 90| 3| 2| 2| 0.5| | Sam| Maths| 90| 1| 1| 1| 0.0| |Merlin| Maths| 90| 2| 1| 1| 0.0| | Robby| Maths| 87| 3| 3| 2| 1.0| | Robby| Physics| 89| 1| 1| 1| 0.0| |Merlin| Physics| 84| 2| 2| 2| 0.5| | Sam| Physics| 76| 3| 3| 3| 1.0| +------+---------+-----+----------+----+----------+------------+
In [159]:
data_rank = data_rank.withColumn('NTILE',ntile(2).over(windowSpec))
data_rank.show()
+------+---------+-----+----------+----+----------+------------+-----+ | Name| Subject|Marks|Row_Number|RANK|DENSE_RANK|PERCENT_RANK|NTILE| +------+---------+-----+----------+----+----------+------------+-----+ |Merlin|Chemistry| 95| 1| 1| 1| 0.0| 1| | Sam|Chemistry| 90| 2| 2| 2| 0.5| 1| | Robby|Chemistry| 90| 3| 2| 2| 0.5| 2| | Sam| Maths| 90| 1| 1| 1| 0.0| 1| |Merlin| Maths| 90| 2| 1| 1| 0.0| 1| | Robby| Maths| 87| 3| 3| 2| 1.0| 2| | Robby| Physics| 89| 1| 1| 1| 0.0| 1| |Merlin| Physics| 84| 2| 2| 2| 0.5| 1| | Sam| Physics| 76| 3| 3| 3| 1.0| 2| +------+---------+-----+----------+----+----------+------------+-----+
In [160]:
data_rank = data_rank.withColumn('CUME_DIST',cume_dist().over(windowSpec))
data_rank.show()
+------+---------+-----+----------+----+----------+------------+-----+------------------+ | Name| Subject|Marks|Row_Number|RANK|DENSE_RANK|PERCENT_RANK|NTILE| CUME_DIST| +------+---------+-----+----------+----+----------+------------+-----+------------------+ |Merlin|Chemistry| 95| 1| 1| 1| 0.0| 1|0.3333333333333333| | Sam|Chemistry| 90| 2| 2| 2| 0.5| 1| 1.0| | Robby|Chemistry| 90| 3| 2| 2| 0.5| 2| 1.0| | Sam| Maths| 90| 1| 1| 1| 0.0| 1|0.6666666666666666| |Merlin| Maths| 90| 2| 1| 1| 0.0| 1|0.6666666666666666| | Robby| Maths| 87| 3| 3| 2| 1.0| 2| 1.0| | Robby| Physics| 89| 1| 1| 1| 0.0| 1|0.3333333333333333| |Merlin| Physics| 84| 2| 2| 2| 0.5| 1|0.6666666666666666| | Sam| Physics| 76| 3| 3| 3| 1.0| 2| 1.0| +------+---------+-----+----------+----+----------+------------+-----+------------------+
Module 14 : Data management using PySpark¶
In [161]:
import pandas as pd
# create two data frames
df1 = pd.DataFrame({'id': ['1001', '1002', '1003'], 'name': ['abc', 'mno', 'opq']})
df2 = pd.DataFrame({'id': ['1001', '1002', '1004'], 'salary': [1100.25, 2200.50, 3300.20]})
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df_py1 = spark.createDataFrame(df1)
df_py2 = spark.createDataFrame(df2)
In [162]:
df_py1.show()
+----+----+ | id|name| +----+----+ |1001| abc| |1002| mno| |1003| opq| +----+----+
In [163]:
df_py2.show()
+----+-------+ | id| salary| +----+-------+ |1001|1100.25| |1002| 2200.5| |1004| 3300.2| +----+-------+
In [164]:
inner_data = df_py1.join(df_py2,df_py1.id == df_py2.id,"inner")
inner_data.show()
+----+----+----+-------+ | id|name| id| salary| +----+----+----+-------+ |1001| abc|1001|1100.25| |1002| mno|1002| 2200.5| +----+----+----+-------+
In [165]:
left_data = df_py1.join(df_py2,df_py1.id == df_py2.id,"left")
left_data.show()
+----+----+----+-------+ | id|name| id| salary| +----+----+----+-------+ |1003| opq|null| null| |1002| mno|1002| 2200.5| |1001| abc|1001|1100.25| +----+----+----+-------+
In [166]:
left_data = df_py1.join(df_py2,df_py1.id == df_py2.id,"full")
left_data.show()
+----+----+----+-------+ | id|name| id| salary| +----+----+----+-------+ |1001| abc|1001|1100.25| |1002| mno|1002| 2200.5| |1003| opq|null| null| |null|null|1004| 3300.2| +----+----+----+-------+
In [167]:
#First dataframes should be registed to SQL
df_py1.createOrReplaceTempView("SQL_1")
df_py2.createOrReplaceTempView("SQL_2")
#Now data tables can be joined usng classical SQL queries:
joined_data1 = spark.sql("select coalesce(a.id, b.id) as ID \
, a.name \
, b.salary \
from SQL_1 a \
full join SQL_2 b \
on a.id == b.id" \
)
In [168]:
joined_data1.show()
+----+----+-------+ | ID|name| salary| +----+----+-------+ |1001| abc|1100.25| |1002| mno| 2200.5| |1003| opq| null| |1004|null| 3300.2| +----+----+-------+
In [169]:
import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': ['1', '2', '3', '4']})
df2 = pd.DataFrame({'col1': [5, 6, 7], 'col3': [1005, 1006, 1007]})
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df_py1 = spark.createDataFrame(df1)
df_py2 = spark.createDataFrame(df2)
In [170]:
df_py1.show()
+----+----+ |col1|col2| +----+----+ | 1| 1| | 2| 2| | 3| 3| | 4| 4| +----+----+
In [171]:
df_py2.show()
+----+----+ |col1|col3| +----+----+ | 5|1005| | 6|1006| | 7|1007| +----+----+
In [172]:
union_DF = df_py2.union(df_py1)
union_DF.show()
+----+----+ |col1|col3| +----+----+ | 5|1005| | 6|1006| | 7|1007| | 1| 1| | 2| 2| | 3| 3| | 4| 4| +----+----+
In [173]:
union_DF.dtypes
Out[173]:
[('col1', 'bigint'), ('col3', 'string')]
In [174]:
union_DF = df_py1.union(df_py2)
union_DF.show()
+----+----+ |col1|col2| +----+----+ | 1| 1| | 2| 2| | 3| 3| | 4| 4| | 5|1005| | 6|1006| | 7|1007| +----+----+
In [175]:
union_DF.dtypes
Out[175]:
[('col1', 'bigint'), ('col2', 'string')]
Module 15 : Let's revise - how to write SQL Qery in pyspark¶
In [176]:
df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
# First we need to register DataFrame as a SQL temporary view
sample_1 = df.select('user_id','browser','sex','class')
sample_1.createOrReplaceTempView("sample_sql")
# You can write query in multiple lines, using a \, which defines End of the Line
new_data = spark.sql(
"SELECT sex \
, class \
, class+10 as class_plus_10 \
, class * 2 as class_multiplied_2 \
, 1- class as class_inversed \
, class/2 as class_divided \
, pow(class+2, 2) as class_power \
, (class+3) % 2 as mod_class \
FROM sample_sql"
)
In [177]:
new_data.show(5)
+---+-----+-------------+------------------+--------------+-------------+-----------+---------+ |sex|class|class_plus_10|class_multiplied_2|class_inversed|class_divided|class_power|mod_class| +---+-----+-------------+------------------+--------------+-------------+-----------+---------+ | M| 0| 10.0| 0.0| 1.0| 0.0| 4.0| 1.0| | F| 0| 10.0| 0.0| 1.0| 0.0| 4.0| 1.0| | M| 1| 11.0| 2.0| 0.0| 0.5| 9.0| 0.0| | M| 0| 10.0| 0.0| 1.0| 0.0| 4.0| 1.0| | M| 0| 10.0| 0.0| 1.0| 0.0| 4.0| 1.0| +---+-----+-------------+------------------+--------------+-------------+-----------+---------+ only showing top 5 rows
In [178]:
new_data = spark.sql(
"SELECT sex \
, browser \
, sum(class) as class_sum \
FROM sample_sql \
where sex = 'M' \
group by sex , browser \
having class_sum > 13 \
"
)
new_data.show()
+---+-------+---------+ |sex|browser|class_sum| +---+-------+---------+ | M| IE| 17.0| | M| Chrome| 23.0| +---+-------+---------+
Module 16 : Miscellaneous¶
In [179]:
# Summarize data in SAS Proc Transpose or Pivot Table of Microsoft Excel
In [180]:
df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
sample_1 = df.select('user_id','browser','sex','class')
sample_1.show(5)
+-------+-------+---+-----+ |user_id|browser|sex|class| +-------+-------+---+-----+ | 22058| Chrome| M| 0| | 333320| Chrome| F| 0| | 1359| Opera| M| 1| | 150084| Safari| M| 0| | 221365| Safari| M| 0| +-------+-------+---+-----+ only showing top 5 rows
In [181]:
sample_1 = sample_1.withColumn("class", col("class").cast("integer"))
In [182]:
transpose = sample_1.groupBy("browser").pivot("sex").sum("class")
transpose.show()
+-------+----+---+---+ |browser|null| F| M| +-------+----+---+---+ |FireFox| 1| 3| 11| | Safari|null| 4| 13| | IE| 1| 12| 17| | Chrome| 1| 11| 23| | Opera| 0| 2| 2| +-------+----+---+---+
In [183]:
transpose = sample_1.groupBy("browser").pivot("sex",["F","M"]).sum("class")
transpose.show()
+-------+---+---+ |browser| F| M| +-------+---+---+ |FireFox| 3| 11| | Safari| 4| 13| | IE| 12| 17| | Chrome| 11| 23| | Opera| 2| 2| +-------+---+---+
In [184]:
transpose = sample_1[(df.browser == 'Chrome') | (df.browser == "Safari")].groupBy("browser").pivot("sex",["F","M"]).sum("class")
transpose.show()
+-------+---+---+ |browser| F| M| +-------+---+---+ | Safari| 4| 13| | Chrome| 11| 23| +-------+---+---+
In [185]:
transpose = sample_1.groupBy("browser").pivot("sex",["F","M"]).sum("class")
transpose.show()
#Sorting data in descending order of crowser and asceding order of F,
# in this case second one is redundant, but it is to demonstrated that we can use multiple columns at once
transpose = transpose.orderBy(transpose.browser.desc(), transpose.F)
transpose.show()
+-------+---+---+ |browser| F| M| +-------+---+---+ |FireFox| 3| 11| | Safari| 4| 13| | IE| 12| 17| | Chrome| 11| 23| | Opera| 2| 2| +-------+---+---+ +-------+---+---+ |browser| F| M| +-------+---+---+ | Safari| 4| 13| | Opera| 2| 2| | IE| 12| 17| |FireFox| 3| 11| | Chrome| 11| 23| +-------+---+---+
No comments:
Post a Comment
Do provide us your feedback, it would help us serve your better.