Pyspark PJ FINAL PySpark is a Python library for working with very large datasets in a distributed computing environment, this makes computing very fast.

Module 0: Before jumping to Pyspark, basics of Python as Pandas should be explored¶

Python basics, concepts of User-Defined Functions (UDF) and Lambda Functions , connection of to Python to Oracle, SQL Server, MySQL, and PostgreSQL can be explored from the following link:¶

http://www.askanalytics.in/p/python-application-part-1.html ¶

Module 1: Importing data in as PySpark - with read.csv and read_excel¶

In [1]:

#if not already installed, use below to install PySpark:
# !pip install pyspark

In [2]:

import pyspark.sql

In [3]:

import warnings
warnings.filterwarnings('ignore')

# to eanble warning
# import warnings
# warnings.filterwarnings('default')

In [4]:

# Following commands are optional, these command install necessary packages required to run PySpark in your system
# It might be possible that your system already have these packages installed.

In [5]:

#conda install -c conda-forge openjdk

In [6]:

#conda update -n base -c defaults conda

In [7]:

#conda install -c conda-forge findspark

In [8]:

#In the spark session, we can customize the RAM consumption
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName('test').config("spark.driver.memory", "16g").getOrCreate() 

In [9]:

#Lets start reading our first data 
df1 = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
df1.show(2)

+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
|user_id|    signup_time| purchase_time|purchase_value|    device_id|source|browser|sex| ip_address|class|       category|      dob|             name|               email|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
|  22058|2/24/2015 22:55|4/18/2015 2:47|         65278|QVPSPJUOCKZAR|   SEO| Chrome|  M|732758368.8|    0|home_essentials|2/22/1976|  Aaron P Maashoh|aaron.maashoh@yah...|
| 333320| 6/7/2015 20:39| 6/8/2015 1:38|         96399|EOGFQPIZPYXFZ|   Ads| Chrome|  F|350311387.9|    0|       apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
only showing top 2 rows

In [10]:

df1.dtypes  #Will give the columns' names and data type

Out[10]:

[('user_id', 'string'),
 ('signup_time', 'string'),
 ('purchase_time', 'string'),
 ('purchase_value', 'string'),
 ('device_id', 'string'),
 ('source', 'string'),
 ('browser', 'string'),
 ('sex', 'string'),
 ('ip_address', 'string'),
 ('class', 'string'),
 ('category', 'string'),
 ('dob', 'string'),
 ('name', 'string'),
 ('email', 'string')]

In [11]:

# Importing all functions form PySpark SQL at once
from pyspark.sql.functions import *

Suppose the data has few rows blank in the top of file, or with some narration, we can skip rows¶

In [12]:

# Step 1 - We read few records on the top to get column names
df2 = spark.read.format('csv').load('input_datsets/Sample_Fraud_Detection _with_comments.csv'
      ,header=False,inferSchema=True).withColumn("index",monotonically_increasing_id()
                                                ).filter((col("index") >=4) & (col("index") <=6)).drop("index")
df2.show()

+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
|    _c0|            _c1|           _c2|           _c3|          _c4|   _c5|    _c6|_c7|        _c8|  _c9|           _c10|     _c11|             _c12|                _c13|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
|user_id|    signup_time| purchase_time|purchase_value|    device_id|source|browser|sex| ip_address|class|       category|      dob|             name|               email|
|  22058|2/24/2015 22:55|4/18/2015 2:47|         65278|QVPSPJUOCKZAR|   SEO| Chrome|  M|732758368.8|    0|home_essentials|2/22/1976|  Aaron P Maashoh|aaron.maashoh@yah...|
| 333320| 6/7/2015 20:39| 6/8/2015 1:38|         96399|EOGFQPIZPYXFZ|   Ads| Chrome|  F|350311387.9|    0|       apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+

In [13]:

# Step 2 - Let's read the data where it begins at

df_main = spark.read.format('csv').load('input_datsets/Sample_Fraud_Detection _with_comments.csv'
         ,header=False,inferSchema=True).withColumn("index", monotonically_increasing_id()
                                                   ).filter(col("index") >= 5).drop("index")
df_main.show(2)

+------+---------------+--------------+-----+-------------+---+------+---+-----------+---+---------------+---------+-----------------+--------------------+
|   _c0|            _c1|           _c2|  _c3|          _c4|_c5|   _c6|_c7|        _c8|_c9|           _c10|     _c11|             _c12|                _c13|
+------+---------------+--------------+-----+-------------+---+------+---+-----------+---+---------------+---------+-----------------+--------------------+
| 22058|2/24/2015 22:55|4/18/2015 2:47|65278|QVPSPJUOCKZAR|SEO|Chrome|  M|732758368.8|  0|home_essentials|2/22/1976|  Aaron P Maashoh|aaron.maashoh@yah...|
|333320| 6/7/2015 20:39| 6/8/2015 1:38|96399|EOGFQPIZPYXFZ|Ads|Chrome|  F|350311387.9|  0|       apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...|
+------+---------------+--------------+-----+-------------+---+------+---+-----------+---+---------------+---------+-----------------+--------------------+
only showing top 2 rows

In [14]:

# Saving column names in a separate object
df3 = df2.head().asDict().values()
print(df3)

dict_values(['user_id', 'signup_time', 'purchase_time', 'purchase_value', 'device_id', 'source', 'browser', 'sex', 'ip_address', 'class', 'category', 'dob', 'name', 'email'])

In [15]:

my_list = list(df3)
print(my_list)

['user_id', 'signup_time', 'purchase_time', 'purchase_value', 'device_id', 'source', 'browser', 'sex', 'ip_address', 'class', 'category', 'dob', 'name', 'email']

In [16]:

# toDF can be used to rename variables, it requires a "list with new variable names" as input

df_final = df_main.toDF(*my_list)
df_final.show(2)

+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
|user_id|    signup_time| purchase_time|purchase_value|    device_id|source|browser|sex| ip_address|class|       category|      dob|             name|               email|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
|  22058|2/24/2015 22:55|4/18/2015 2:47|         65278|QVPSPJUOCKZAR|   SEO| Chrome|  M|732758368.8|    0|home_essentials|2/22/1976|  Aaron P Maashoh|aaron.maashoh@yah...|
| 333320| 6/7/2015 20:39| 6/8/2015 1:38|         96399|EOGFQPIZPYXFZ|   Ads| Chrome|  F|350311387.9|    0|       apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
only showing top 2 rows

In [17]:

# let's try modifying the list and then use it for renaming the variables

my_list[11] = 'DATE_OF_BIRTH'
print(my_list)

['user_id', 'signup_time', 'purchase_time', 'purchase_value', 'device_id', 'source', 'browser', 'sex', 'ip_address', 'class', 'category', 'DATE_OF_BIRTH', 'name', 'email']

In [18]:

df_final = df_final.toDF(*my_list)
df_final.show(2)

+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-------------+-----------------+--------------------+
|user_id|    signup_time| purchase_time|purchase_value|    device_id|source|browser|sex| ip_address|class|       category|DATE_OF_BIRTH|             name|               email|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-------------+-----------------+--------------------+
|  22058|2/24/2015 22:55|4/18/2015 2:47|         65278|QVPSPJUOCKZAR|   SEO| Chrome|  M|732758368.8|    0|home_essentials|    2/22/1976|  Aaron P Maashoh|aaron.maashoh@yah...|
| 333320| 6/7/2015 20:39| 6/8/2015 1:38|         96399|EOGFQPIZPYXFZ|   Ads| Chrome|  F|350311387.9|    0|       apparels|     1/2/1962|Rick D Rothackerj|rick_rothackerj@y...|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-------------+-----------------+--------------------+
only showing top 2 rows

In [19]:

#Let's delete unnecessary datasets, equivalent to Proc delete of SAS

del df_main , df2

In [20]:

# df_main.show() # please un-comment and run to check that objects have been dropped. This will thorw error!

Module 2: Let's do some basic check of data¶

In [21]:

# check top 3 rows
df1 = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
df1.head(3)

Out[21]:

[Row(user_id='22058', signup_time='2/24/2015 22:55', purchase_time='4/18/2015 2:47', purchase_value='65278', device_id='QVPSPJUOCKZAR', source='SEO', browser='Chrome', sex='M', ip_address='732758368.8', class='0', category='home_essentials', dob='2/22/1976', name='Aaron P Maashoh', email='aaron.maashoh@yahoo.com'),
 Row(user_id='333320', signup_time='6/7/2015 20:39', purchase_time='6/8/2015 1:38', purchase_value='96399', device_id='EOGFQPIZPYXFZ', source='Ads', browser='Chrome', sex='F', ip_address='350311387.9', class='0', category='apparels', dob='1/2/1962', name='Rick D Rothackerj', email='rick_rothackerj@yahoo.com'),
 Row(user_id='1359', signup_time='1/1/2015 18:52', purchase_time='1/1/2015 18:52', purchase_value='57296', device_id='YSSKYOSJHPPLJ', source='SEO', browser='Opera', sex='M', ip_address='2621473820', class='1', category='electronics', dob='3/5/1962', name='Harriet M McLeodd', email='harriet.mcleodd@gmail.com')]

In [22]:

# check bottom 4 rows
df1.tail(4)

Out[22]:

[Row(user_id='328579', signup_time='4/10/2015 1:02', purchase_time='6/25/2015 2:05', purchase_value='54775', device_id='UQCGEFBSOAVIJ', source='SEO', browser='Chrome', sex='M', ip_address='1038384255', class='0', category='electronics', dob='4/6/1994', name='Praveen L Menonw', email='praveen.menonw@gmail.com'),
 Row(user_id='263174', signup_time='7/14/2015 2:13', purchase_time='7/24/2015 1:25', purchase_value='84302', device_id='HVSGFVZSDBKEH', source='Ads', browser='Safari', sex='M', ip_address='4169079617', class='0', category='electronics', dob='9/1/1979', name='Ra T Alpern', email='ra.alpern@flossresearch.com'),
 Row(user_id='117191', signup_time='2/28/2015 17:14', purchase_time='4/14/2015 15:39', purchase_value='20193', device_id='GDHCTKIKPHENW', source='Direct', browser='IE', sex='M', ip_address='1411318028', class='0', category='apparels', dob='11/17/1979', name='Julia D Edwardsj', email='julia.edwardsj@sutletgroup.com'),
 Row(user_id='65732', signup_time='1/11/2015 11:10', purchase_time='1/22/2015 4:16', purchase_value='26225', device_id='VSMNAOFPSEQOL', source='Ads', browser='IE', sex='M', ip_address='3765208398', class='0', category='apparels', dob='3/23/1988', name='Conor V Humphriesk', email='conor.humphriesk@gmail.com')]

In [23]:

# Columns and data types
print(df1)

DataFrame[user_id: string, signup_time: string, purchase_time: string, purchase_value: string, device_id: string, source: string, browser: string, sex: string, ip_address: string, class: string, category: string, dob: string, name: string, email: string]

In [24]:

# Convert the "numeric value" column from string to integer/float
df1 = df1.withColumn("purchase_value", col("purchase_value").cast("float"))
df1 = df1.withColumn("class", col("class").cast("integer"))

In [25]:

print(df1)

DataFrame[user_id: string, signup_time: string, purchase_time: string, purchase_value: float, device_id: string, source: string, browser: string, sex: string, ip_address: string, class: int, category: string, dob: string, name: string, email: string]

In [26]:

mean_value = df1.select(mean("purchase_value").alias("avg_value"))
mean_value.show()

+-----------------+
|        avg_value|
+-----------------+
|50992.49149149149|
+-----------------+

In [27]:

# Basic summarization of a variable

mean_value = df1.select(mean("purchase_value")).collect()[0][0]
max1 = df1.select(max("purchase_value")).collect()[0][0]
min1 = df1.select(min("purchase_value")).collect()[0][0]
sum1 = df1.select(sum("purchase_value")).collect()[0][0]

print("Mean = " , mean_value)
print("Max = " , max1)
print("Min = " , min1)
print("Sum = " , sum1)

Mean =  50992.49149149149
Max =  99743.0
Min =  1176.0
Sum =  50941499.0

In [28]:

# Let's find the mean of all the numeric variables in a dataframe
nunumeric_cols = [c[0] for c in df1.dtypes if c[1] in ("double", "float", "int", "long")]
print (nunumeric_cols)

['purchase_value', 'class']

In [29]:

length = len(nunumeric_cols)
print(length)

In [30]:

for i in range(len(nunumeric_cols)):
    mean_value = df1.select(mean(nunumeric_cols[i])).collect()[0][0]
    max1 = df1.select(max(nunumeric_cols[i])).collect()[0][0]
    min1 = df1.select(min(nunumeric_cols[i])).collect()[0][0]
    sum1 = df1.select(sum(nunumeric_cols[i])).collect()[0][0]
   
    print('Mean Value of ', nunumeric_cols[i], 'is' ,mean_value)
    print("Max Value of ", nunumeric_cols[i], "is" ,max1)
    print("Min Value of " ,nunumeric_cols[i], "is" ,min1)
    print("Sum Value of ", nunumeric_cols[i] ,"is" ,sum1)

Mean Value of  purchase_value is 50992.49149149149
Max Value of  purchase_value is 99743.0
Min Value of  purchase_value is 1176.0
Sum Value of  purchase_value is 50941499.0
Mean Value of  class is 0.1011011011011011
Max Value of  class is 1
Min Value of  class is 0
Sum Value of  class is 101

In [31]:

# Use describe function to find the basis stats of a numeric variable
df1.select('purchase_value').describe().show()

+-------+------------------+
|summary|    purchase_value|
+-------+------------------+
|  count|               999|
|   mean| 50992.49149149149|
| stddev|28681.076653460594|
|    min|            1176.0|
|    max|           99743.0|
+-------+------------------+

In [32]:

# To check the basis stats of all numeric variables,
    # we can do it using for loop and iterating it over all numeric variables
    
for i in range(len(nunumeric_cols)):
    df1.select(nunumeric_cols[i]).describe().show()

+-------+------------------+
|summary|    purchase_value|
+-------+------------------+
|  count|               999|
|   mean| 50992.49149149149|
| stddev|28681.076653460594|
|    min|            1176.0|
|    max|           99743.0|
+-------+------------------+

+-------+-------------------+
|summary|              class|
+-------+-------------------+
|  count|                999|
|   mean| 0.1011011011011011|
| stddev|0.30161354453884753|
|    min|                  0|
|    max|                  1|
+-------+-------------------+

In [33]:

# To know the frequency distribution of a string/categorical vairable, we can use following groupBy function
# The output is similar to Proc Freq of SAS

df_freq = df1.groupBy('browser').agg(count('browser').alias('frequency'))
df_freq.show()

+-------+---------+
|browser|frequency|
+-------+---------+
|FireFox|      155|
| Safari|      166|
|     IE|      260|
| Chrome|      397|
|  Opera|       21|
+-------+---------+

In [34]:

# We can sort the output based on frequency, "desc" is an optional argument for sort data in descending way

df_freq = df1.groupBy('browser').agg(count('browser').alias('frequency')).orderBy(desc('frequency'))
df_freq.show()

+-------+---------+
|browser|frequency|
+-------+---------+
| Chrome|      397|
|     IE|      260|
| Safari|      166|
|FireFox|      155|
|  Opera|       21|
+-------+---------+

In [35]:

# Count gives number of records in a data
df1.count()

Out[35]:

In [36]:

# We can use following method to get the frequency and %age distribution 

df_freq = df1.groupBy('browser').agg(count('browser').alias('frequency'),(count('browser')
                                       / df1.count() *100).alias('percentage')).orderBy(desc('frequency'))
df_freq.show()

+-------+---------+------------------+
|browser|frequency|        percentage|
+-------+---------+------------------+
| Chrome|      397| 39.73973973973974|
|     IE|      260|26.026026026026027|
| Safari|      166|16.616616616616618|
|FireFox|      155|15.515515515515515|
|  Opera|       21|2.1021021021021022|
+-------+---------+------------------+

In [37]:

df_freq_1 = df_freq.select('browser','frequency', bround('percentage',1).alias('percent'))
df_freq_1.show()

+-------+---------+-------+
|browser|frequency|percent|
+-------+---------+-------+
| Chrome|      397|   39.7|
|     IE|      260|   26.0|
| Safari|      166|   16.6|
|FireFox|      155|   15.5|
|  Opera|       21|    2.1|
+-------+---------+-------+

In [38]:

# let's iterate the above method for all char /categorical variables

categorical_cols = [c[0] for c in df1.dtypes if c[1] in ('category', 'object' , 'string')]
print(categorical_cols)

['user_id', 'signup_time', 'purchase_time', 'device_id', 'source', 'browser', 'sex', 'ip_address', 'category', 'dob', 'name', 'email']

In [39]:

# Let's delete few columns from the list, that need not be analyzed 
to_delete = ['user_id','device_id','signup_time','purchase_time','name','dob','email','ip_address']

for x in to_delete:
        if x in categorical_cols:
                categorical_cols.remove(x)

print(categorical_cols)

['source', 'browser', 'sex', 'category']

In [40]:

for col in categorical_cols:
    df_freq = df1.groupBy(col).agg(
        count(col).alias('frequency'),
        (count(col)/df1.count() * 100).alias('percentage')
    ).orderBy(col)
    df_freq.show()

+------+---------+-----------------+
|source|frequency|       percentage|
+------+---------+-----------------+
|   Ads|      408|40.84084084084084|
|Direct|      192|19.21921921921922|
|   SEO|      399|39.93993993993994|
+------+---------+-----------------+

+-------+---------+------------------+
|browser|frequency|        percentage|
+-------+---------+------------------+
| Chrome|      397| 39.73973973973974|
|FireFox|      155|15.515515515515515|
|     IE|      260|26.026026026026027|
|  Opera|       21|2.1021021021021022|
| Safari|      166|16.616616616616618|
+-------+---------+------------------+

+----+---------+-----------------+
| sex|frequency|       percentage|
+----+---------+-----------------+
|null|        0|              0.0|
|   F|      414|41.44144144144144|
|   M|      560|56.05605605605606|
+----+---------+-----------------+

+---------------+---------+-----------------+
|       category|frequency|       percentage|
+---------------+---------+-----------------+
|       apparels|      204|20.42042042042042|
|      cosmetics|      210|21.02102102102102|
|    electronics|      204|20.42042042042042|
|    health_care|      187|18.71871871871872|
|home_essentials|      194|19.41941941941942|
+---------------+---------+-----------------+

In [41]:

# Find number of missing/NULL values in a field

null_counts = df1.agg(sum(df1.sex.isNull().cast("int")).alias('missing_sex'))
null_counts.show()

+-----------+
|missing_sex|
+-----------+
|         25|
+-----------+

In [42]:

#Use following codes to identify how many NULL values are there in each field
# This is equivalent to df.isna().sum() of Pandas

from pyspark.sql.functions import col, sum
null_counts = df1.agg(*[sum(col(c).isNull().cast("int")).alias(c) for c in df1.columns])
null_counts.show(vertical = True)

-RECORD 0-------------
 user_id        | 0   
 signup_time    | 0   
 purchase_time  | 0   
 purchase_value | 0   
 device_id      | 0   
 source         | 0   
 browser        | 0   
 sex            | 25  
 ip_address     | 0   
 class          | 0   
 category       | 0   
 dob            | 29  
 name           | 0   
 email          | 0

In [43]:

# to find distinct values of a single column
distinct_values = df1.select("browser").distinct().collect()
print(distinct_values)

[Row(browser='FireFox'), Row(browser='Safari'), Row(browser='IE'), Row(browser='Chrome'), Row(browser='Opera')]

In [44]:

# To find count of unique value of each column
unique_values_count = df1.agg(countDistinct('signup_time').alias('unique_count'))
unique_values_count.show()

+------------+
|unique_count|
+------------+
|         995|
+------------+

In [45]:

distinct_counts = df1.select(*[countDistinct(c).alias(c) for c in df1.columns])
distinct_counts.show(vertical = True)

-RECORD 0-------------
 user_id        | 999 
 signup_time    | 995 
 purchase_time  | 998 
 purchase_value | 994 
 device_id      | 994 
 source         | 3   
 browser        | 5   
 sex            | 2   
 ip_address     | 995 
 class          | 2   
 category       | 5   
 dob            | 917 
 name           | 999 
 email          | 998

In [46]:

# Sometimes with a very large data, we dont need a precise result, rather a "best BUT quick guess"
# for that we can use approx_count_distinct, which is pretty fast in processing

distinct_counts = df1.select(*[approx_count_distinct(c).alias(c) for c in df1.columns])
distinct_counts.show(vertical = True)

-RECORD 0--------------
 user_id        | 1073 
 signup_time    | 1098 
 purchase_time  | 1124 
 purchase_value | 925  
 device_id      | 890  
 source         | 3    
 browser        | 5    
 sex            | 2    
 ip_address     | 983  
 class          | 2    
 category       | 5    
 dob            | 950  
 name           | 1012 
 email          | 1073

Following is equivalent to proc Freq functionality in SAS¶

In [47]:

value_counts = df1.groupBy("source","browser").count()
value_counts.show()

+------+-------+-----+
|source|browser|count|
+------+-------+-----+
|   Ads|  Opera|    9|
|   SEO|  Opera|    9|
|   Ads| Safari|   75|
|Direct|     IE|   48|
|   Ads| Chrome|  151|
|   SEO| Safari|   58|
|Direct|FireFox|   27|
|   SEO| Chrome|  165|
|Direct| Chrome|   81|
|Direct|  Opera|    3|
|   Ads|     IE|  109|
|   SEO|     IE|  103|
|   Ads|FireFox|   64|
|   SEO|FireFox|   64|
|Direct| Safari|   33|
+------+-------+-----+

In [48]:

# we can use count distinct as well
unique_values_count3 = df1.groupBy("source","browser").agg(countDistinct('signup_time').alias('unique_count'))
unique_values_count3.show()

+------+-------+------------+
|source|browser|unique_count|
+------+-------+------------+
|   Ads|  Opera|           9|
|   SEO|  Opera|           9|
|   Ads| Safari|          73|
|Direct|     IE|          47|
|   Ads| Chrome|         151|
|   SEO| Safari|          58|
|Direct|FireFox|          27|
|   SEO| Chrome|         165|
|Direct| Chrome|          81|
|Direct|  Opera|           3|
|   Ads|     IE|         109|
|   SEO|     IE|         102|
|   Ads|FireFox|          64|
|   SEO|FireFox|          64|
|Direct| Safari|          33|
+------+-------+------------+

Module 3: Useful fuctions to get basic information about the data¶

.printSchema()- will also give information about missing values¶

In [49]:

df1.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- signup_time: string (nullable = true)
 |-- purchase_time: string (nullable = true)
 |-- purchase_value: float (nullable = true)
 |-- device_id: string (nullable = true)
 |-- source: string (nullable = true)
 |-- browser: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- class: integer (nullable = true)
 |-- category: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- name: string (nullable = true)
 |-- email: string (nullable = true)

In [50]:

#To get all the columns of data frame
df1.dtypes

Out[50]:

[('user_id', 'string'),
 ('signup_time', 'string'),
 ('purchase_time', 'string'),
 ('purchase_value', 'float'),
 ('device_id', 'string'),
 ('source', 'string'),
 ('browser', 'string'),
 ('sex', 'string'),
 ('ip_address', 'string'),
 ('class', 'int'),
 ('category', 'string'),
 ('dob', 'string'),
 ('name', 'string'),
 ('email', 'string')]

In [51]:

 # gives numbers of records
num_rows = df1.count()
print("Number of rows: {}".format(num_rows))

Number of rows: 999

In [52]:

# Gives numbers of records and number of fields in data

num_rows = df1.count()
num_cols = len(df1.columns)
print("Shape of DataFrame: ({}, {})".format(num_rows, num_cols))

Shape of DataFrame: (999, 14)

Module 4 : Subsetting data Vertically and Horizontally (selecting subset of a DataFrame)¶

In [53]:

# Selecting only few columns from a DataFrame

df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
sample = df.select("name")
sample.show(2)

+-----------------+
|             name|
+-----------------+
|  Aaron P Maashoh|
|Rick D Rothackerj|
+-----------------+
only showing top 2 rows

In [54]:

sample_1 = df.select("name","dob","sex")
sample_1.show(2)

+-----------------+---------+---+
|             name|      dob|sex|
+-----------------+---------+---+
|  Aaron P Maashoh|2/22/1976|  M|
|Rick D Rothackerj| 1/2/1962|  F|
+-----------------+---------+---+
only showing top 2 rows

In [55]:

# We can create a monotonic - row id (_N_ equivalent of SAS) following way

sample_2 = sample_1.withColumn("index", monotonically_increasing_id())
sample_2.show(5)

+-----------------+---------+---+-----+
|             name|      dob|sex|index|
+-----------------+---------+---+-----+
|  Aaron P Maashoh|2/22/1976|  M|    0|
|Rick D Rothackerj| 1/2/1962|  F|    1|
|Harriet M McLeodd| 3/5/1962|  M|    2|
|  Sinead M Carews| 7/3/1974|  M|    3|
|Laurence H Frosty|8/25/1970|  M|    4|
+-----------------+---------+---+-----+
only showing top 5 rows

In [56]:

# Selecting few rows from DataFrame

sample_2=sample_1.withColumn("index", monotonically_increasing_id()).filter(col("index") >=995).drop("index")
sample_2.show()

+------------------+----------+---+
|              name|       dob|sex|
+------------------+----------+---+
|  Praveen L Menonw|  4/6/1994|  M|
|       Ra T Alpern|  9/1/1979|  M|
|  Julia D Edwardsj|11/17/1979|  M|
|Conor V Humphriesk| 3/23/1988|  M|
+------------------+----------+---+

In [57]:

# Selecting range of rows from DataFrame - 3rd to 10the record

sample_2=sample_1.withColumn("index", monotonically_increasing_id()
                            ).filter((col("index") >=3) & (col("index") <=10) )
sample_2.show()

+--------------------+----------+---+-----+
|                name|       dob|sex|index|
+--------------------+----------+---+-----+
|     Sinead M Carews|  7/3/1974|  M|    3|
|   Laurence H Frosty| 8/25/1970|  M|    4|
|        Lee Z Chyenz|  4/8/1997|  M|    5|
|Dhanya J Skariachano|  5/1/1996|  F|    6|
|   Lisa E Baertleinu|11/24/1981|  M|    7|
|   Andreas B Cremerf| 1/22/1972|  F|    8|
|         Ra J Alperx|  2/6/1984|  M|    9|
|     Matt U Falloonz| 6/17/1980|  F|   10|
+--------------------+----------+---+-----+

How to Reaname the Variables¶

In [58]:

# Method 1 - Basic method
method_1 = df.withColumnRenamed('sex','gender') \
          .withColumnRenamed('dob','birth_date') \
          .withColumnRenamed('email','email_id')
method_1.show(2,vertical = False, truncate = False )

+-------+---------------+--------------+--------------+-------------+------+-------+------+-----------+-----+---------------+----------+-----------------+-------------------------+
|user_id|signup_time    |purchase_time |purchase_value|device_id    |source|browser|gender|ip_address |class|category       |birth_date|name             |email_id                 |
+-------+---------------+--------------+--------------+-------------+------+-------+------+-----------+-----+---------------+----------+-----------------+-------------------------+
|22058  |2/24/2015 22:55|4/18/2015 2:47|65278         |QVPSPJUOCKZAR|SEO   |Chrome |M     |732758368.8|0    |home_essentials|2/22/1976 |Aaron P Maashoh  |aaron.maashoh@yahoo.com  |
|333320 |6/7/2015 20:39 |6/8/2015 1:38 |96399         |EOGFQPIZPYXFZ|Ads   |Chrome |F     |350311387.9|0    |apparels       |1/2/1962  |Rick D Rothackerj|rick_rothackerj@yahoo.com|
+-------+---------------+--------------+--------------+-------------+------+-------+------+-----------+-----+---------------+----------+-----------------+-------------------------+
only showing top 2 rows

In [59]:

# Method 2 - Little complex
df.createOrReplaceTempView("temp")
method_2 = spark.sql("SELECT * FROM temp").withColumnRenamed("Name", "Name_full").select("Name_full", "dob", "sex")
# Show the DataFrame with index
method_2.show(2)

+-----------------+---------+---+
|        Name_full|      dob|sex|
+-----------------+---------+---+
|  Aaron P Maashoh|2/22/1976|  M|
|Rick D Rothackerj| 1/2/1962|  F|
+-----------------+---------+---+
only showing top 2 rows

In [60]:

# Method 3 - Simpler for legacy SQL users

df.createOrReplaceTempView("temp")
# Or we can simply do renaming within SQL query
method_3 = spark.sql("SELECT name as Name_full, dob, sex FROM temp")
# Show the DataFrame with index
method_3.show(2)

+-----------------+---------+---+
|        Name_full|      dob|sex|
+-----------------+---------+---+
|  Aaron P Maashoh|2/22/1976|  M|
|Rick D Rothackerj| 1/2/1962|  F|
+-----------------+---------+---+
only showing top 2 rows

In [61]:

# We have used it above as well, this is the method to create the row index - similar to _N_ of SAS
df_index = df.withColumn("index", monotonically_increasing_id())
df_index.show(2)

+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+-----+
|user_id|    signup_time| purchase_time|purchase_value|    device_id|source|browser|sex| ip_address|class|       category|      dob|             name|               email|index|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+-----+
|  22058|2/24/2015 22:55|4/18/2015 2:47|         65278|QVPSPJUOCKZAR|   SEO| Chrome|  M|732758368.8|    0|home_essentials|2/22/1976|  Aaron P Maashoh|aaron.maashoh@yah...|    0|
| 333320| 6/7/2015 20:39| 6/8/2015 1:38|         96399|EOGFQPIZPYXFZ|   Ads| Chrome|  F|350311387.9|    0|       apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...|    1|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+-----+
only showing top 2 rows

In [62]:

# Select columns based on position i.e. Index

df2 = df[df.columns[1:5]]
df2.show(2)

+---------------+--------------+--------------+-------------+
|    signup_time| purchase_time|purchase_value|    device_id|
+---------------+--------------+--------------+-------------+
|2/24/2015 22:55|4/18/2015 2:47|         65278|QVPSPJUOCKZAR|
| 6/7/2015 20:39| 6/8/2015 1:38|         96399|EOGFQPIZPYXFZ|
+---------------+--------------+--------------+-------------+
only showing top 2 rows

In [63]:

# Select a set of Columns between Labels(column-names), in following example we need columns from user_id to dob

df3 = df.select(df.columns[df.columns.index('signup_time'):df.columns.index('device_id') + 1])
# show the resulting DataFrame
df3.show(2)

+---------------+--------------+--------------+-------------+
|    signup_time| purchase_time|purchase_value|    device_id|
+---------------+--------------+--------------+-------------+
|2/24/2015 22:55|4/18/2015 2:47|         65278|QVPSPJUOCKZAR|
| 6/7/2015 20:39| 6/8/2015 1:38|         96399|EOGFQPIZPYXFZ|
+---------------+--------------+--------------+-------------+
only showing top 2 rows

In [64]:

# Add a unique ID column to the DataFrame
df_with_id = df.withColumn("id", monotonically_increasing_id())

# Filter out the rows with few selected IDs
df_skipped_rows = df_with_id.filter(df_with_id.id.isin([0, 2, 4, 7]))
df_skipped_rows.show(2)

+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+---+
|user_id|    signup_time| purchase_time|purchase_value|    device_id|source|browser|sex| ip_address|class|       category|      dob|             name|               email| id|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+---+
|  22058|2/24/2015 22:55|4/18/2015 2:47|         65278|QVPSPJUOCKZAR|   SEO| Chrome|  M|732758368.8|    0|home_essentials|2/22/1976|  Aaron P Maashoh|aaron.maashoh@yah...|  0|
|   1359| 1/1/2015 18:52|1/1/2015 18:52|         57296|YSSKYOSJHPPLJ|   SEO|  Opera|  M| 2621473820|    1|    electronics| 3/5/1962|Harriet M McLeodd|harriet.mcleodd@g...|  2|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+---+
only showing top 2 rows

How to drop variables?¶

In [65]:

s1 = df.drop("dob")
s1.show(2)

+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-----------------+--------------------+
|user_id|    signup_time| purchase_time|purchase_value|    device_id|source|browser|sex| ip_address|class|       category|             name|               email|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-----------------+--------------------+
|  22058|2/24/2015 22:55|4/18/2015 2:47|         65278|QVPSPJUOCKZAR|   SEO| Chrome|  M|732758368.8|    0|home_essentials|  Aaron P Maashoh|aaron.maashoh@yah...|
| 333320| 6/7/2015 20:39| 6/8/2015 1:38|         96399|EOGFQPIZPYXFZ|   Ads| Chrome|  F|350311387.9|    0|       apparels|Rick D Rothackerj|rick_rothackerj@y...|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-----------------+--------------------+
only showing top 2 rows

In [66]:

s2 = df.drop("dob","email","purchase_time")
s2.show(2)

+-------+---------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-----------------+
|user_id|    signup_time|purchase_value|    device_id|source|browser|sex| ip_address|class|       category|             name|
+-------+---------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-----------------+
|  22058|2/24/2015 22:55|         65278|QVPSPJUOCKZAR|   SEO| Chrome|  M|732758368.8|    0|home_essentials|  Aaron P Maashoh|
| 333320| 6/7/2015 20:39|         96399|EOGFQPIZPYXFZ|   Ads| Chrome|  F|350311387.9|    0|       apparels|Rick D Rothackerj|
+-------+---------------+--------------+-------------+------+-------+---+-----------+-----+---------------+-----------------+
only showing top 2 rows

In [67]:

# Remove all columns between column index 1 to 4
s3 = df.drop(*df.columns[1:5])
s3.show(2)

+-------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
|user_id|source|browser|sex| ip_address|class|       category|      dob|             name|               email|
+-------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
|  22058|   SEO| Chrome|  M|732758368.8|    0|home_essentials|2/22/1976|  Aaron P Maashoh|aaron.maashoh@yah...|
| 333320|   Ads| Chrome|  F|350311387.9|    0|       apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...|
+-------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
only showing top 2 rows

Module 5: Conditional filter and LOGICAL operators similar to where clause in SAS/SQL¶

Extract data with a single condition¶

In [68]:

df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
df.show(2)

+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
|user_id|    signup_time| purchase_time|purchase_value|    device_id|source|browser|sex| ip_address|class|       category|      dob|             name|               email|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
|  22058|2/24/2015 22:55|4/18/2015 2:47|         65278|QVPSPJUOCKZAR|   SEO| Chrome|  M|732758368.8|    0|home_essentials|2/22/1976|  Aaron P Maashoh|aaron.maashoh@yah...|
| 333320| 6/7/2015 20:39| 6/8/2015 1:38|         96399|EOGFQPIZPYXFZ|   Ads| Chrome|  F|350311387.9|    0|       apparels| 1/2/1962|Rick D Rothackerj|rick_rothackerj@y...|
+-------+---------------+--------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+-----------------+--------------------+
only showing top 2 rows

In [69]:

# We should convert number value column to Integer/float in order to do such operations
# , but even if variable is string, it will work fine

df = df.withColumn("purchase_value", col("purchase_value").cast("integer"))
above = df.filter(col("purchase_value") > 90000)
min1 = above.select(min("purchase_value")).collect()[0][0]
print(min1)

Extract rows with multiple conditions using operators " | ". It filters data similar to OR operater in where caluse of SAS/SQL.¶

In [70]:

chrm_ie = df.filter((col("browser") == "Chrome") | (col("browser") == "IE"))
c1=chrm_ie.groupBy("browser").count()
c1.show()

+-------+-----+
|browser|count|
+-------+-----+
|     IE|  260|
| Chrome|  397|
+-------+-----+

.isin() : multiple "OR" conditions over a single column¶

In [71]:

chrm_ie = df[df.browser.isin(["Chrome", "IE"])]       #The code here is exactly as same as that of pandas
c1=chrm_ie.groupBy("browser").count()
c1.show()

+-------+-----+
|browser|count|
+-------+-----+
|     IE|  260|
| Chrome|  397|
+-------+-----+

In [72]:

chrm_ie = df.filter((col("browser") == "Chrome") & (col("purchase_value") > 90000))
max1= chrm_ie.select(max("purchase_value")).collect()[0][0]
print(max1)

use "between" to filter a numerical column between a range of values¶

In [73]:

btw1 = df[df['purchase_value'].between(80000, 90000)]       #The code here is exactly as same as that of pandas
btw1.count()

Out[73]:

In [74]:

#Another way to write this
btw2 = df.filter(col("purchase_value").between(80000, 90000))
btw2.count()

Out[74]:

In [75]:

#One more way to write this
btw = df.where(df.purchase_value.between(80000, 90000))
btw.count()

Out[75]:

How to use "contains" condition in where clause¶

In [76]:

above = df.filter(col("source").rlike("SE"))
c1=above.groupBy("source").count()
c1.show()

+------+-----+
|source|count|
+------+-----+
|   SEO|  399|
+------+-----+

Unlike pandas, in PySpark we can use Like %% operators, YAY!!!¶

In [77]:

data_subset = df.filter(df.source.like("%SE%")) 
data_subset.count()

Out[77]:

Module 6 : Derive a new column(s)¶

SELECT *, purchase_value/82.52 as convt FROM tips;¶

In [78]:

df = df.withColumn("Value_in_USD", format_string("$%.2f",col("purchase_value")/82.52))
df.select("purchase_value","Value_in_USD").show(5)

# How to read $%.2f : $ will add a dollar sign
#                     % means value 
#                    .2f makes it round to two decimal places

+--------------+------------+
|purchase_value|Value_in_USD|
+--------------+------------+
|         65278|     $791.06|
|         96399|    $1168.19|
|         57296|     $694.33|
|         43650|     $528.96|
|         45016|     $545.52|
+--------------+------------+
only showing top 5 rows

Add counter : column with incremental values from 1 onwards with range(start, stop, step)¶

In [79]:

from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql.window import Window

start1 = 1  # starting value for increment_by_1 column
start2 = 0  # starting value for increment_by_5 column

window = Window.orderBy("user_id")  # order by a unique column

df2 = df.withColumn("increment_by_1", row_number().over(window) + start1 - 1
                   ).withColumn("increment_by_5", ((row_number().over(window) - 1) * 5) + start2)
df2.select("user_id","increment_by_1","increment_by_5").show(5)

+-------+--------------+--------------+
|user_id|increment_by_1|increment_by_5|
+-------+--------------+--------------+
|  10055|             1|             0|
|  10079|             2|             5|
| 100886|             3|            10|
| 100998|             4|            15|
| 101564|             5|            20|
+-------+--------------+--------------+
only showing top 5 rows

Module 6.1 :Text functions¶

use length function -- to finds the length of a character string¶

In [80]:

#from pyspark.sql.functions import length
#or better 
from pyspark.sql.functions import *
df = df.withColumn('email_length',length('email'))
df.select(['email','email_length']).show(2, truncate = False)

+-------------------------+------------+
|email                    |email_length|
+-------------------------+------------+
|aaron.maashoh@yahoo.com  |23          |
|rick_rothackerj@yahoo.com|25          |
+-------------------------+------------+
only showing top 2 rows

How to change the case of a string¶

In [81]:

df = df.withColumn('email_all_caps',upper ('email'))
df = df.withColumn('email_all_lower',lower('email_all_caps'))
df = df.withColumn('email_propcase',initcap('email_all_caps'))

df.select(['email','email_all_caps','email_all_lower','email_propcase']).show(2, truncate = False)

+-------------------------+-------------------------+-------------------------+-------------------------+
|email                    |email_all_caps           |email_all_lower          |email_propcase           |
+-------------------------+-------------------------+-------------------------+-------------------------+
|aaron.maashoh@yahoo.com  |AARON.MAASHOH@YAHOO.COM  |aaron.maashoh@yahoo.com  |Aaron.maashoh@yahoo.com  |
|rick_rothackerj@yahoo.com|RICK_ROTHACKERJ@YAHOO.COM|rick_rothackerj@yahoo.com|Rick_rothackerj@yahoo.com|
+-------------------------+-------------------------+-------------------------+-------------------------+
only showing top 2 rows

instr and locate functions - finds the first occurenece position of a character in a column of strings¶

In [82]:

df = df.withColumn('Position_of_@',instr('email_propcase','@'))
df = df.withColumn('Position_of_a',instr('email_propcase','a'))
df = df.withColumn('Position_of_A_CAPS',instr('email_propcase','A'))

df.select(['email_propcase','Position_of_@','Position_of_a','Position_of_A_CAPS']).show(2,truncate = False)

+-------------------------+-------------+-------------+------------------+
|email_propcase           |Position_of_@|Position_of_a|Position_of_A_CAPS|
+-------------------------+-------------+-------------+------------------+
|Aaron.maashoh@yahoo.com  |14           |2            |1                 |
|Rick_rothackerj@yahoo.com|16           |10           |0                 |
+-------------------------+-------------+-------------+------------------+
only showing top 2 rows

In [83]:

df = df.withColumn('Location_of_@',locate('@','email_propcase'))
df = df.withColumn('Location_of_a',locate('a','email_propcase'))
df = df.withColumn('Location_of_A_CAPS',locate('A','email_propcase'))
df.select(['email_propcase','Location_of_@','Location_of_a','Location_of_A_CAPS']).show(2,truncate = False)

+-------------------------+-------------+-------------+------------------+
|email_propcase           |Location_of_@|Location_of_a|Location_of_A_CAPS|
+-------------------------+-------------+-------------+------------------+
|Aaron.maashoh@yahoo.com  |14           |2            |1                 |
|Rick_rothackerj@yahoo.com|16           |10           |0                 |
+-------------------------+-------------+-------------+------------------+
only showing top 2 rows

How to split a string on the basis of a delimiter, example portion in email before and after @¶

In [84]:

df = df.withColumn('email_part1',substring_index(df.email, '@', 1))
df = df.withColumn('email_part2',substring_index(df.email, '@', -1))
df.select(['email','email_part1','email_part2']).show(2,truncate = False)

+-------------------------+---------------+-----------+
|email                    |email_part1    |email_part2|
+-------------------------+---------------+-----------+
|aaron.maashoh@yahoo.com  |aaron.maashoh  |yahoo.com  |
|rick_rothackerj@yahoo.com|rick_rothackerj|yahoo.com  |
+-------------------------+---------------+-----------+
only showing top 2 rows

Another way is using a split function¶

In [85]:

df.select('email',split(df.email,"@").getItem(0).alias("email_part1")
                 ,split(df.email,"@").getItem(1).alias("email_part2")).show(5,truncate = False)

+--------------------------+---------------+-----------+
|email                     |email_part1    |email_part2|
+--------------------------+---------------+-----------+
|aaron.maashoh@yahoo.com   |aaron.maashoh  |yahoo.com  |
|rick_rothackerj@yahoo.com |rick_rothackerj|yahoo.com  |
|harriet.mcleodd@gmail.com |harriet.mcleodd|gmail.com  |
|sinead.carews@gmail.com   |sinead.carews  |gmail.com  |
|laurence.frosty@tntech.edu|laurence.frosty|tntech.edu |
+--------------------------+---------------+-----------+
only showing top 5 rows

let's now CONCATENATE the strings¶

In [86]:

df.select('email_part1','email_part2'
           ,concat(df.email_part1,lit("@"),df.email_part2).alias("FULL_EMAIL")).show(2,truncate = False)

+---------------+-----------+-------------------------+
|email_part1    |email_part2|FULL_EMAIL               |
+---------------+-----------+-------------------------+
|aaron.maashoh  |yahoo.com  |aaron.maashoh@yahoo.com  |
|rick_rothackerj|yahoo.com  |rick_rothackerj@yahoo.com|
+---------------+-----------+-------------------------+
only showing top 2 rows

In the above example, we also used function "lit", which is used to create a column with one static value in a column, lit is abbreviation of literal.¶

Following method is close of "CATX" function in SAS¶

In [87]:

df.select('email_part1','email_part2',concat_ws('@',df.email_part1,df.email_part2).alias("FULL_EMAIL")).show(2,truncate = False)

+---------------+-----------+-------------------------+
|email_part1    |email_part2|FULL_EMAIL               |
+---------------+-----------+-------------------------+
|aaron.maashoh  |yahoo.com  |aaron.maashoh@yahoo.com  |
|rick_rothackerj|yahoo.com  |rick_rothackerj@yahoo.com|
+---------------+-----------+-------------------------+
only showing top 2 rows

Using a substr function in Pyspark¶

In [88]:

df = df.withColumn('email_first_3',df.email.substr(1,3))
df.select(['email','email_first_3']).show(2,truncate = False)

+-------------------------+-------------+
|email                    |email_first_3|
+-------------------------+-------------+
|aaron.maashoh@yahoo.com  |aar          |
|rick_rothackerj@yahoo.com|ric          |
+-------------------------+-------------+
only showing top 2 rows

Another method to use these functions is with a 'select' statement, which is quite simple and SQL style¶

In [89]:

df1 = df.select('email',substring(df.email, 1, 3).alias('email_first_3'))
df1.show(5,truncate = False)

+--------------------------+-------------+
|email                     |email_first_3|
+--------------------------+-------------+
|aaron.maashoh@yahoo.com   |aar          |
|rick_rothackerj@yahoo.com |ric          |
|harriet.mcleodd@gmail.com |har          |
|sinead.carews@gmail.com   |sin          |
|laurence.frosty@tntech.edu|lau          |
+--------------------------+-------------+
only showing top 5 rows

Translate function works same as it works in SAS¶

In [90]:

df1 = df.select('email',translate(df.email, 'a', 'X').alias('email_translated'))
df1.show(5,truncate = False)

+--------------------------+--------------------------+
|email                     |email_translated          |
+--------------------------+--------------------------+
|aaron.maashoh@yahoo.com   |XXron.mXXshoh@yXhoo.com   |
|rick_rothackerj@yahoo.com |rick_rothXckerj@yXhoo.com |
|harriet.mcleodd@gmail.com |hXrriet.mcleodd@gmXil.com |
|sinead.carews@gmail.com   |sineXd.cXrews@gmXil.com   |
|laurence.frosty@tntech.edu|lXurence.frosty@tntech.edu|
+--------------------------+--------------------------+
only showing top 5 rows

Following is the SAS Tranwrd equivalent, but with regular expression it becomes much more versatile.¶

In [91]:

df1 = df.select('email',regexp_replace(df.email, ('@\w+\.\w+'), '@some_domain').alias('regex_replaced'))
df1.show(5,truncate = False)

+--------------------------+---------------------------+
|email                     |regex_replaced             |
+--------------------------+---------------------------+
|aaron.maashoh@yahoo.com   |aaron.maashoh@some_domain  |
|rick_rothackerj@yahoo.com |rick_rothackerj@some_domain|
|harriet.mcleodd@gmail.com |harriet.mcleodd@some_domain|
|sinead.carews@gmail.com   |sinead.carews@some_domain  |
|laurence.frosty@tntech.edu|laurence.frosty@some_domain|
+--------------------------+---------------------------+
only showing top 5 rows

In [92]:

import pyspark.pandas as ps
# We can convert a pyspark data to Pandas dataframe, and vice versa as well
PYARROW_IGNORE_TIMEZONE = 0
result_pdf = df1.toPandas()

In [93]:

result_pdf.head(5)

Out[93]:

	email	regex_replaced
0	aaron.maashoh@yahoo.com	aaron.maashoh@some_domain
1	rick_rothackerj@yahoo.com	rick_rothackerj@some_domain
2	harriet.mcleodd@gmail.com	harriet.mcleodd@some_domain
3	sinead.carews@gmail.com	sinead.carews@some_domain
4	laurence.frosty@tntech.edu	laurence.frosty@some_domain

Module 7 : Regular Expressions (popular by name "regex")¶

Regular Expression is a sequence of characters that make up a pattern to find, replace, and extract textual data.¶

The built-in library in Python for handling regular expression is "re"¶

Below Code is used to read a word file line by line into pyspark data frame

In [94]:

text_1 = ['There was further spike in negative sentiments with the outbreak.']

In [95]:

print(text_1)

['There was further spike in negative sentiments with the outbreak.']

In [96]:

# Please install if not already installed
# !pip install python-docx
import docx

In [97]:

#Convert the Word file into a text file using a library like python-docx:

doc = docx.Document('input_datsets/word1.docx')
# extract the text from the Word file
text = '\n'.join([paragraph.text for paragraph in doc.paragraphs])

# write the text to a file
with open('my_text_file.txt', 'w') as f:
    f.write(text)

In [98]:

#Read the text file into a PySpark DataFrame using spark.read.text(). 

df = spark.read.text('my_text_file.txt')

# show the contents of the DataFrame
df.show(truncate=False)

+-----------------------------------------------------------------+
|value                                                            |
+-----------------------------------------------------------------+
|There was further spike in negative sentiments with the outbreak.|
+-----------------------------------------------------------------+

In [99]:

# We have already explained "Regex" in detail in our Panda's blog. 
# Regex can also be used in pyspark, basic remnaining same.
# We can use the functions such as regexp_extract, regex_replace (already explained in the Pandas blog)

In [100]:

import pandas as pd
data = {'Customer_id': ['cus_01', 'cus_02', 'cus_03', 'cus_04'], 
        'email': ['name.surname@gmail.com','anonymous123@yahoo.co.ukk','anonymous123@.com','a..@gmail.com']
        } 
pd_df = pd.DataFrame(data)
pd_df

Out[100]:

	Customer_id	email
0	cus_01	name.surname@gmail.com
1	cus_02	anonymous123@yahoo.co.ukk
2	cus_03	anonymous123@.com
3	cus_04	a..@gmail.com

In [101]:

# let's convert the Pandas data to Pyspark

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df_py = spark.createDataFrame(pd_df)

In [102]:

df_py.show(truncate= False)

+-----------+-------------------------+
|Customer_id|email                    |
+-----------+-------------------------+
|cus_01     |name.surname@gmail.com   |
|cus_02     |anonymous123@yahoo.co.ukk|
|cus_03     |anonymous123@.com        |
|cus_04     |a..@gmail.com            |
+-----------+-------------------------+

In [103]:

df_py.select('Customer_id', 'email'
              , regexp_extract(df_py.email, '(\w+[._-]*\w*)', 1).alias('first_portion')).show(truncate= False)

+-----------+-------------------------+-------------+
|Customer_id|email                    |first_portion|
+-----------+-------------------------+-------------+
|cus_01     |name.surname@gmail.com   |name.surname |
|cus_02     |anonymous123@yahoo.co.ukk|anonymous123 |
|cus_03     |anonymous123@.com        |anonymous123 |
|cus_04     |a..@gmail.com            |a..          |
+-----------+-------------------------+-------------+

In [104]:

df_py.select('Customer_id', 'email'
             , regexp_extract(df_py.email, '@(\w+)',1).alias('second_portion')).show(truncate= False)

+-----------+-------------------------+--------------+
|Customer_id|email                    |second_portion|
+-----------+-------------------------+--------------+
|cus_01     |name.surname@gmail.com   |gmail         |
|cus_02     |anonymous123@yahoo.co.ukk|yahoo         |
|cus_03     |anonymous123@.com        |              |
|cus_04     |a..@gmail.com            |gmail         |
+-----------+-------------------------+--------------+

Module 8 : Handling the Date Time¶

In [105]:

# from pyspark.sql.functions import current_date, current_timestamp 
df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
df_dt = df.select(  'signup_time'
                   , 'purchase_time'
                  , current_date().alias("todays_date")
                  , current_timestamp().alias("whats_time_now")
                 )

In [106]:

df_dt.show(4,truncate= False)

+---------------+--------------+-----------+-----------------------+
|signup_time    |purchase_time |todays_date|whats_time_now         |
+---------------+--------------+-----------+-----------------------+
|2/24/2015 22:55|4/18/2015 2:47|2023-04-28 |2023-04-28 08:45:39.276|
|6/7/2015 20:39 |6/8/2015 1:38 |2023-04-28 |2023-04-28 08:45:39.276|
|1/1/2015 18:52 |1/1/2015 18:52|2023-04-28 |2023-04-28 08:45:39.276|
|4/28/2015 21:13|5/4/2015 13:54|2023-04-28 |2023-04-28 08:45:39.276|
+---------------+--------------+-----------+-----------------------+
only showing top 4 rows

In [107]:

df_dt.dtypes
# Please notice the variable types below

Out[107]:

[('signup_time', 'string'),
 ('purchase_time', 'string'),
 ('todays_date', 'date'),
 ('whats_time_now', 'timestamp')]

Useful link for datetime, https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

get date, time, date elements, time elements from date¶

In [108]:

df_dt1 = df_dt.select('*', to_timestamp('signup_time','M/d/y H:m').alias('converted_date'))
df_dt1.show(2,truncate = False)

+---------------+--------------+-----------+-----------------------+-------------------+
|signup_time    |purchase_time |todays_date|whats_time_now         |converted_date     |
+---------------+--------------+-----------+-----------------------+-------------------+
|2/24/2015 22:55|4/18/2015 2:47|2023-04-28 |2023-04-28 08:45:39.408|2015-02-24 22:55:00|
|6/7/2015 20:39 |6/8/2015 1:38 |2023-04-28 |2023-04-28 08:45:39.408|2015-06-07 20:39:00|
+---------------+--------------+-----------+-----------------------+-------------------+
only showing top 2 rows

In [109]:

# from pyspark.sql.functions import *
df_dt1.select('converted_date'
              , to_date('converted_date').alias('datepart') 
              , year('converted_date').alias('year') 
              , month('converted_date').alias('month') 
              , dayofmonth('converted_date').alias('day') 
              , dayofweek('converted_date').alias("dayofweek")
              , hour('converted_date').alias('hour') 
              , minute('converted_date').alias('min') 
              , second('converted_date').alias('sec') 
             ).show(2)

+-------------------+----------+----+-----+---+---------+----+---+---+
|     converted_date|  datepart|year|month|day|dayofweek|hour|min|sec|
+-------------------+----------+----+-----+---+---------+----+---+---+
|2015-02-24 22:55:00|2015-02-24|2015|    2| 24|        3|  22| 55|  0|
|2015-06-07 20:39:00|2015-06-07|2015|    6|  7|        1|  20| 39|  0|
+-------------------+----------+----+-----+---+---------+----+---+---+
only showing top 2 rows

Start and end date of the time series data¶

In [110]:

conv_dt = df_dt1.agg(min("converted_date")
                   , max("converted_date")
                   , (max("converted_date") - min("converted_date")).alias("date_differece"))
conv_dt.show(truncate = False)

+-------------------+-------------------+-------------------------------------+
|min(converted_date)|max(converted_date)|date_differece                       |
+-------------------+-------------------+-------------------------------------+
|2015-01-01 04:43:00|2015-08-18 02:36:00|INTERVAL '228 21:53:00' DAY TO SECOND|
+-------------------+-------------------+-------------------------------------+

Check if year is leap year¶

In [111]:

def is_leap_year(year_col):
    return when((year(year_col) % 4 == 0) & ((year(year_col) % 100 != 0) | (year(year_col) % 400 == 0)), True).otherwise(False)

leap_year = df_dt1.withColumn("is_leap_year", is_leap_year(("converted_date")))
leap_year.show(5, truncate = False)

+---------------+--------------+-----------+-----------------------+-------------------+------------+
|signup_time    |purchase_time |todays_date|whats_time_now         |converted_date     |is_leap_year|
+---------------+--------------+-----------+-----------------------+-------------------+------------+
|2/24/2015 22:55|4/18/2015 2:47|2023-04-28 |2023-04-28 08:45:39.916|2015-02-24 22:55:00|false       |
|6/7/2015 20:39 |6/8/2015 1:38 |2023-04-28 |2023-04-28 08:45:39.916|2015-06-07 20:39:00|false       |
|1/1/2015 18:52 |1/1/2015 18:52|2023-04-28 |2023-04-28 08:45:39.916|2015-01-01 18:52:00|false       |
|4/28/2015 21:13|5/4/2015 13:54|2023-04-28 |2023-04-28 08:45:39.916|2015-04-28 21:13:00|false       |
|7/21/2015 7:09 |9/9/2015 18:40|2023-04-28 |2023-04-28 08:45:39.916|2015-07-21 07:09:00|false       |
+---------------+--------------+-----------+-----------------------+-------------------+------------+
only showing top 5 rows

In [112]:

# Let's calculate AGE 

df1 = df.withColumn("signup_date",to_date(to_timestamp("signup_time", "M/d/y H:m"), "yyyy-MM-dd"))
df1 = df1.withColumn("dob_date", to_date("dob", "M/d/y"))
df2 = df1.withColumn("age_at_signup_days", datediff("signup_date", "dob_date")/365.25)

df2.select("signup_time","signup_date", "dob", "dob_date", "age_at_signup_days").show(5, truncate = False)

+---------------+-----------+---------+----------+------------------+
|signup_time    |signup_date|dob      |dob_date  |age_at_signup_days|
+---------------+-----------+---------+----------+------------------+
|2/24/2015 22:55|2015-02-24 |2/22/1976|1976-02-22|39.006160164271044|
|6/7/2015 20:39 |2015-06-07 |1/2/1962 |1962-01-02|53.426420260095824|
|1/1/2015 18:52 |2015-01-01 |3/5/1962 |1962-03-05|52.82683093771389 |
|4/28/2015 21:13|2015-04-28 |7/3/1974 |1974-07-03|40.818617385352496|
|7/21/2015 7:09 |2015-07-21 |8/25/1970|1970-08-25|44.903490759753595|
+---------------+-----------+---------+----------+------------------+
only showing top 5 rows

In [113]:

df_dt1

Out[113]:

DataFrame[signup_time: string, purchase_time: string, todays_date: date, whats_time_now: timestamp, converted_date: timestamp]

Subset data between two dates¶

In [114]:

subset = df.filter((col('dob') >= '1985-02-15') & (col('dob') <= '2010-02-28'))
subset.show(2)

+-------+---------------+---------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+---------------+--------------------+
|user_id|    signup_time|  purchase_time|purchase_value|    device_id|source|browser|sex| ip_address|class|       category|      dob|           name|               email|
+-------+---------------+---------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+---------------+--------------------+
|  22058|2/24/2015 22:55| 4/18/2015 2:47|         65278|QVPSPJUOCKZAR|   SEO| Chrome|  M|732758368.8|    0|home_essentials|2/22/1976|Aaron P Maashoh|aaron.maashoh@yah...|
| 182338|1/25/2015 17:49|3/23/2015 23:05|         37926|NRFFPPHZYFUVC|   Ads|     IE|  M|341674739.6|    0|       apparels| 2/6/1984|    Ra J Alperx| ra&alperx@gmail.com|
+-------+---------------+---------------+--------------+-------------+------+-------+---+-----------+-----+---------------+---------+---------------+--------------------+
only showing top 2 rows

Start/End of previous quarter/month/year¶

In [115]:

import pandas as pd
from datetime import *

# create a list of dates
dates = [date(2022, 1, 1), date(2022, 1, 2), date(2022, 1, 3)]

# create a DataFrame with a date column
df = pd.DataFrame({'date_col': dates})

In [116]:

df

Out[116]:

	date_col
0	2022-01-01
1	2022-01-02
2	2022-01-03

In [117]:

df1 = spark.createDataFrame(df)
df1 = df1.withColumn('date', to_timestamp('date_col', 'yyyy-MM-dd'))

# print the PySpark DataFrame schema and contents
df1.printSchema()
df1.show()

root
 |-- date_col: date (nullable = true)
 |-- date: timestamp (nullable = true)

+----------+-------------------+
|  date_col|               date|
+----------+-------------------+
|2022-01-01|2022-01-01 00:00:00|
|2022-01-02|2022-01-02 00:00:00|
|2022-01-03|2022-01-03 00:00:00|
+----------+-------------------+

We can move the date - intnx funcationality in SAS or Trunc in Oracle SQL?¶

In [118]:

#  Let's derive:
#  CMB >>>> Current month beginning
#  CME >>>> Current month ending
#  PMB >>>> Previous month beginning
#  PME >>>> Previous month ending
#  PYB >>>> Previous year beginning
#  PYE >>>> Previous year ending
#  PQB >>>> Previous quarter beginning
#  PQE >>>> Previous quarter ending

#  It is very much Oracle SQL style.

In [119]:

df_dt1.show(2)

+---------------+--------------+-----------+--------------------+-------------------+
|    signup_time| purchase_time|todays_date|      whats_time_now|     converted_date|
+---------------+--------------+-----------+--------------------+-------------------+
|2/24/2015 22:55|4/18/2015 2:47| 2023-04-28|2023-04-28 08:45:...|2015-02-24 22:55:00|
| 6/7/2015 20:39| 6/8/2015 1:38| 2023-04-28|2023-04-28 08:45:...|2015-06-07 20:39:00|
+---------------+--------------+-----------+--------------------+-------------------+
only showing top 2 rows

In [120]:

df_dt1.select('converted_date'
              , trunc('converted_date','Month').alias('CMB')
              , last_day('converted_date').alias('CME')
              , trunc(add_months('converted_date',-1),'Month').alias('PMB')
              , (trunc('converted_date','Month')-1).alias('PME') 
              , trunc((trunc('converted_date','year')-1),'year').alias('PYB')
              , (trunc('converted_date','year')-1).alias('PYE')
              , trunc((trunc('converted_date','quarter')-1),'quarter').alias('PQB')
              , (trunc('converted_date','quarter')-1).alias('PQE')

             ).show(2)

+-------------------+----------+----------+----------+----------+----------+----------+----------+----------+
|     converted_date|       CMB|       CME|       PMB|       PME|       PYB|       PYE|       PQB|       PQE|
+-------------------+----------+----------+----------+----------+----------+----------+----------+----------+
|2015-02-24 22:55:00|2015-02-01|2015-02-28|2015-01-01|2015-01-31|2014-01-01|2014-12-31|2014-10-01|2014-12-31|
|2015-06-07 20:39:00|2015-06-01|2015-06-30|2015-05-01|2015-05-31|2014-01-01|2014-12-31|2015-01-01|2015-03-31|
+-------------------+----------+----------+----------+----------+----------+----------+----------+----------+
only showing top 2 rows

In [121]:

#Applying current date
df_dt1.select( trunc(current_date(),'Month').alias('CMB')
              , last_day(current_date()).alias('CME')
               ,  trunc(add_months(current_date(),-1),'Month').alias('PMB')
               , (trunc(current_date(),'Month')-1).alias('PME') 
               , trunc((trunc(current_date(),'year')-1),'year').alias('PYB')
               , (trunc(current_date(),'year')-1).alias('PYE')
               , trunc((trunc(current_date(),'quarter')-1),'quarter').alias('PQB')
               , (trunc(current_date(),'quarter')-1).alias('PQE')

             ).show(2)

+----------+----------+----------+----------+----------+----------+----------+----------+
|       CMB|       CME|       PMB|       PME|       PYB|       PYE|       PQB|       PQE|
+----------+----------+----------+----------+----------+----------+----------+----------+
|2023-04-01|2023-04-30|2023-03-01|2023-03-31|2022-01-01|2022-12-31|2023-01-01|2023-03-31|
|2023-04-01|2023-04-30|2023-03-01|2023-03-31|2022-01-01|2022-12-31|2023-01-01|2023-03-31|
+----------+----------+----------+----------+----------+----------+----------+----------+
only showing top 2 rows

Let's convert a date to charcter and back to date¶

In [122]:

df_dt2 = df_dt1.select('converted_date'
               , concat(year('converted_date') ,lit('-'),  month('converted_date') ,lit('-'), dayofmonth('converted_date')
             ).alias('char_date'))

In [123]:

df_dt2.show(2)

+-------------------+---------+
|     converted_date|char_date|
+-------------------+---------+
|2015-02-24 22:55:00|2015-2-24|
|2015-06-07 20:39:00| 2015-6-7|
+-------------------+---------+
only showing top 2 rows

In [124]:

df_dt2.dtypes

Out[124]:

[('converted_date', 'timestamp'), ('char_date', 'string')]

In [125]:

df_dt2 = df_dt2.select('*'
                        ,to_date('char_date').alias('back_to_date'))

In [126]:

df_dt2.show(2)

+-------------------+---------+------------+
|     converted_date|char_date|back_to_date|
+-------------------+---------+------------+
|2015-02-24 22:55:00|2015-2-24|  2015-02-24|
|2015-06-07 20:39:00| 2015-6-7|  2015-06-07|
+-------------------+---------+------------+
only showing top 2 rows

In [127]:

df_dt2.dtypes

Out[127]:

[('converted_date', 'timestamp'),
 ('char_date', 'string'),
 ('back_to_date', 'date')]

Module 9 : Arithmetic Operations on PySpark DataFrame¶

In [128]:

df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')

In [129]:

sample_1 = df.select('user_id','browser','sex','class')

sample_1 = sample_1.withColumn('class_plus_10',col('class') + 10)
sample_1 = sample_1.withColumn('class_inverse',1 - col('class') )
sample_1 = sample_1.withColumn('class_multiplied_2',col('class') * 10)
sample_1 = sample_1.withColumn('class_divided',col('class') / 2)
sample_1 = sample_1.withColumn('mod_remainder',(col('class')+3) % 2)
sample_1 = sample_1.select('*',pow('class_plus_10',2).alias('powered'))

sample_1.show(5)

+-------+-------+---+-----+-------------+-------------+------------------+-------------+-------------+-------+
|user_id|browser|sex|class|class_plus_10|class_inverse|class_multiplied_2|class_divided|mod_remainder|powered|
+-------+-------+---+-----+-------------+-------------+------------------+-------------+-------------+-------+
|  22058| Chrome|  M|    0|         10.0|          1.0|               0.0|          0.0|          1.0|  100.0|
| 333320| Chrome|  F|    0|         10.0|          1.0|               0.0|          0.0|          1.0|  100.0|
|   1359|  Opera|  M|    1|         11.0|          0.0|              10.0|          0.5|          0.0|  121.0|
| 150084| Safari|  M|    0|         10.0|          1.0|               0.0|          0.0|          1.0|  100.0|
| 221365| Safari|  M|    0|         10.0|          1.0|               0.0|          0.0|          1.0|  100.0|
+-------+-------+---+-----+-------------+-------------+------------------+-------------+-------------+-------+
only showing top 5 rows

Module 10 : GroupBy() Method for PySpark DataFrame¶

In [130]:

df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
df = df.withColumn("class", col("class").cast("integer"))

In [131]:

sample_1 = df.select('user_id','browser','sex','class')
sample_1.show(5)

+-------+-------+---+-----+
|user_id|browser|sex|class|
+-------+-------+---+-----+
|  22058| Chrome|  M|    0|
| 333320| Chrome|  F|    0|
|   1359|  Opera|  M|    1|
| 150084| Safari|  M|    0|
| 221365| Safari|  M|    0|
+-------+-------+---+-----+
only showing top 5 rows

In [132]:

sample_1.dtypes

Out[132]:

[('user_id', 'string'),
 ('browser', 'string'),
 ('sex', 'string'),
 ('class', 'int')]

In [133]:

summary_data= df.groupBy('browser','sex').sum("class")
summary_data.show()

+-------+----+----------+
|browser| sex|sum(class)|
+-------+----+----------+
|  Opera|   M|         2|
| Safari|   F|         4|
|FireFox|null|         1|
|  Opera|   F|         2|
| Chrome|   M|        23|
|FireFox|   M|        11|
| Chrome|   F|        11|
|     IE|null|         1|
| Chrome|null|         1|
|     IE|   F|        12|
|     IE|   M|        17|
|FireFox|   F|         3|
| Safari|   M|        13|
|  Opera|null|         0|
+-------+----+----------+

In [134]:

# here is a better way:
from pyspark.sql.functions import sum,avg,max,min,mean,count
summary_data  = sample_1.groupBy('browser','sex').agg(sum('class').alias('sum_class') \
                                                      , mean('class').alias('avg_class') \
                                                      , min('class').alias('min_class') \
                                                      , max('class').alias('max_class') \
                                                      , count('class').alias('count_class')
                                                      , countDistinct('class').alias('count_distinct')
                                                     )
summary_data.show()

+-------+----+---------+--------------------+---------+---------+-----------+--------------+
|browser| sex|sum_class|           avg_class|min_class|max_class|count_class|count_distinct|
+-------+----+---------+--------------------+---------+---------+-----------+--------------+
|  Opera|   M|        2| 0.18181818181818182|        0|        1|         11|             2|
| Safari|   F|        4|0.058823529411764705|        0|        1|         68|             2|
|FireFox|null|        1|                 0.2|        0|        1|          5|             2|
|  Opera|   F|        2|                0.25|        0|        1|          8|             2|
| Chrome|   M|       23| 0.10550458715596331|        0|        1|        218|             2|
|FireFox|   M|       11| 0.12359550561797752|        0|        1|         89|             2|
| Chrome|   F|       11|  0.0658682634730539|        0|        1|        167|             2|
|     IE|null|        1| 0.16666666666666666|        0|        1|          6|             2|
| Chrome|null|        1| 0.08333333333333333|        0|        1|         12|             2|
|     IE|   F|       12| 0.10909090909090909|        0|        1|        110|             2|
|     IE|   M|       17| 0.11805555555555555|        0|        1|        144|             2|
|FireFox|   F|        3| 0.04918032786885246|        0|        1|         61|             2|
| Safari|   M|       13|  0.1326530612244898|        0|        1|         98|             2|
|  Opera|null|        0|                 0.0|        0|        0|          2|             1|
+-------+----+---------+--------------------+---------+---------+-----------+--------------+

Module 11: Another way to use where condition, first method was explained in Module 5¶

In [135]:

# Here we are learning filtering based on:
# > one criterion
# > AND/OR and NOT condition and filter based on multiple criteria
# > NULL, NOT NULL
# > in (list)
# > startwith and endwith
# > contains
# > like
# > rlike ( for regular expression)

# We can use both keywords >> filter() and where()

In [136]:

summary_data.filter("sex == 'M'").show()

+-------+---+---------+-------------------+---------+---------+-----------+--------------+
|browser|sex|sum_class|          avg_class|min_class|max_class|count_class|count_distinct|
+-------+---+---------+-------------------+---------+---------+-----------+--------------+
|  Opera|  M|        2|0.18181818181818182|        0|        1|         11|             2|
| Chrome|  M|       23|0.10550458715596331|        0|        1|        218|             2|
|FireFox|  M|       11|0.12359550561797752|        0|        1|         89|             2|
|     IE|  M|       17|0.11805555555555555|        0|        1|        144|             2|
| Safari|  M|       13| 0.1326530612244898|        0|        1|         98|             2|
+-------+---+---------+-------------------+---------+---------+-----------+--------------+

In [137]:

summary_data.filter((summary_data.sex == "M") & (summary_data.browser == "Chrome") ).show()

+-------+---+---------+-------------------+---------+---------+-----------+--------------+
|browser|sex|sum_class|          avg_class|min_class|max_class|count_class|count_distinct|
+-------+---+---------+-------------------+---------+---------+-----------+--------------+
| Chrome|  M|       23|0.10550458715596331|        0|        1|        218|             2|
+-------+---+---------+-------------------+---------+---------+-----------+--------------+

In [138]:

summary_data.filter((summary_data.sex == "M") | (summary_data.browser == "Chrome") ).show()

+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|browser| sex|sum_class|          avg_class|min_class|max_class|count_class|count_distinct|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|  Opera|   M|        2|0.18181818181818182|        0|        1|         11|             2|
| Chrome|   M|       23|0.10550458715596331|        0|        1|        218|             2|
|FireFox|   M|       11|0.12359550561797752|        0|        1|         89|             2|
| Chrome|   F|       11| 0.0658682634730539|        0|        1|        167|             2|
| Chrome|null|        1|0.08333333333333333|        0|        1|         12|             2|
|     IE|   M|       17|0.11805555555555555|        0|        1|        144|             2|
| Safari|   M|       13| 0.1326530612244898|        0|        1|         98|             2|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+

In [139]:

summary_data.filter((summary_data.sex != "M")).show()

+-------+---+---------+--------------------+---------+---------+-----------+--------------+
|browser|sex|sum_class|           avg_class|min_class|max_class|count_class|count_distinct|
+-------+---+---------+--------------------+---------+---------+-----------+--------------+
| Safari|  F|        4|0.058823529411764705|        0|        1|         68|             2|
|  Opera|  F|        2|                0.25|        0|        1|          8|             2|
| Chrome|  F|       11|  0.0658682634730539|        0|        1|        167|             2|
|     IE|  F|       12| 0.10909090909090909|        0|        1|        110|             2|
|FireFox|  F|        3| 0.04918032786885246|        0|        1|         61|             2|
+-------+---+---------+--------------------+---------+---------+-----------+--------------+

In [140]:

summary_data.filter( "sex is NULL AND browser is not NULL").show()

+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|browser| sex|sum_class|          avg_class|min_class|max_class|count_class|count_distinct|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|FireFox|null|        1|                0.2|        0|        1|          5|             2|
|     IE|null|        1|0.16666666666666666|        0|        1|          6|             2|
| Chrome|null|        1|0.08333333333333333|        0|        1|         12|             2|
|  Opera|null|        0|                0.0|        0|        0|          2|             1|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+

In [141]:

summary_data.filter( "sex is NULL OR browser is not NULL").show()

+-------+----+---------+--------------------+---------+---------+-----------+--------------+
|browser| sex|sum_class|           avg_class|min_class|max_class|count_class|count_distinct|
+-------+----+---------+--------------------+---------+---------+-----------+--------------+
|  Opera|   M|        2| 0.18181818181818182|        0|        1|         11|             2|
| Safari|   F|        4|0.058823529411764705|        0|        1|         68|             2|
|FireFox|null|        1|                 0.2|        0|        1|          5|             2|
|  Opera|   F|        2|                0.25|        0|        1|          8|             2|
| Chrome|   M|       23| 0.10550458715596331|        0|        1|        218|             2|
|FireFox|   M|       11| 0.12359550561797752|        0|        1|         89|             2|
| Chrome|   F|       11|  0.0658682634730539|        0|        1|        167|             2|
|     IE|null|        1| 0.16666666666666666|        0|        1|          6|             2|
| Chrome|null|        1| 0.08333333333333333|        0|        1|         12|             2|
|     IE|   F|       12| 0.10909090909090909|        0|        1|        110|             2|
|     IE|   M|       17| 0.11805555555555555|        0|        1|        144|             2|
|FireFox|   F|        3| 0.04918032786885246|        0|        1|         61|             2|
| Safari|   M|       13|  0.1326530612244898|        0|        1|         98|             2|
|  Opera|null|        0|                 0.0|        0|        0|          2|             1|
+-------+----+---------+--------------------+---------+---------+-----------+--------------+

In [142]:

list_1 = ["Chrome","FireFox"]
summary_data.filter(summary_data.browser.isin(list_1)).show()

+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|browser| sex|sum_class|          avg_class|min_class|max_class|count_class|count_distinct|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|FireFox|null|        1|                0.2|        0|        1|          5|             2|
| Chrome|   M|       23|0.10550458715596331|        0|        1|        218|             2|
|FireFox|   M|       11|0.12359550561797752|        0|        1|         89|             2|
| Chrome|   F|       11| 0.0658682634730539|        0|        1|        167|             2|
| Chrome|null|        1|0.08333333333333333|        0|        1|         12|             2|
|FireFox|   F|        3|0.04918032786885246|        0|        1|         61|             2|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+

In [143]:

list_1 = ["Chrome","FireFox"]
summary_data.filter(summary_data.browser.isin(list_1) == False).show()

+-------+----+---------+--------------------+---------+---------+-----------+--------------+
|browser| sex|sum_class|           avg_class|min_class|max_class|count_class|count_distinct|
+-------+----+---------+--------------------+---------+---------+-----------+--------------+
|  Opera|   M|        2| 0.18181818181818182|        0|        1|         11|             2|
| Safari|   F|        4|0.058823529411764705|        0|        1|         68|             2|
|  Opera|   F|        2|                0.25|        0|        1|          8|             2|
|     IE|null|        1| 0.16666666666666666|        0|        1|          6|             2|
|     IE|   F|       12| 0.10909090909090909|        0|        1|        110|             2|
|     IE|   M|       17| 0.11805555555555555|        0|        1|        144|             2|
| Safari|   M|       13|  0.1326530612244898|        0|        1|         98|             2|
|  Opera|null|        0|                 0.0|        0|        0|          2|             1|
+-------+----+---------+--------------------+---------+---------+-----------+--------------+

In [144]:

list_1 = ["Chrome","FireFox"]
summary_data.filter(~summary_data.browser.isin(list_1)).show()

+-------+----+---------+--------------------+---------+---------+-----------+--------------+
|browser| sex|sum_class|           avg_class|min_class|max_class|count_class|count_distinct|
+-------+----+---------+--------------------+---------+---------+-----------+--------------+
|  Opera|   M|        2| 0.18181818181818182|        0|        1|         11|             2|
| Safari|   F|        4|0.058823529411764705|        0|        1|         68|             2|
|  Opera|   F|        2|                0.25|        0|        1|          8|             2|
|     IE|null|        1| 0.16666666666666666|        0|        1|          6|             2|
|     IE|   F|       12| 0.10909090909090909|        0|        1|        110|             2|
|     IE|   M|       17| 0.11805555555555555|        0|        1|        144|             2|
| Safari|   M|       13|  0.1326530612244898|        0|        1|         98|             2|
|  Opera|null|        0|                 0.0|        0|        0|          2|             1|
+-------+----+---------+--------------------+---------+---------+-----------+--------------+

In [145]:

summary_data.filter(summary_data.browser.startswith("O")).show()

+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|browser| sex|sum_class|          avg_class|min_class|max_class|count_class|count_distinct|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|  Opera|   M|        2|0.18181818181818182|        0|        1|         11|             2|
|  Opera|   F|        2|               0.25|        0|        1|          8|             2|
|  Opera|null|        0|                0.0|        0|        0|          2|             1|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+

In [146]:

summary_data.filter(summary_data.browser.endswith("e")).show()

+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|browser| sex|sum_class|          avg_class|min_class|max_class|count_class|count_distinct|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+
| Chrome|   M|       23|0.10550458715596331|        0|        1|        218|             2|
| Chrome|   F|       11| 0.0658682634730539|        0|        1|        167|             2|
| Chrome|null|        1|0.08333333333333333|        0|        1|         12|             2|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+

In [147]:

summary_data.filter(summary_data.browser.contains("hr")).show()

+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|browser| sex|sum_class|          avg_class|min_class|max_class|count_class|count_distinct|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+
| Chrome|   M|       23|0.10550458715596331|        0|        1|        218|             2|
| Chrome|   F|       11| 0.0658682634730539|        0|        1|        167|             2|
| Chrome|null|        1|0.08333333333333333|        0|        1|         12|             2|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+

In [148]:

summary_data.filter(summary_data.browser.like("%hr%")).show()

+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|browser| sex|sum_class|          avg_class|min_class|max_class|count_class|count_distinct|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+
| Chrome|   M|       23|0.10550458715596331|        0|        1|        218|             2|
| Chrome|   F|       11| 0.0658682634730539|        0|        1|        167|             2|
| Chrome|null|        1|0.08333333333333333|        0|        1|         12|             2|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+

In [149]:

summary_data.filter(summary_data.browser.rlike("\w+Fox")).show()

+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|browser| sex|sum_class|          avg_class|min_class|max_class|count_class|count_distinct|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|FireFox|null|        1|                0.2|        0|        1|          5|             2|
|FireFox|   M|       11|0.12359550561797752|        0|        1|         89|             2|
|FireFox|   F|        3|0.04918032786885246|        0|        1|         61|             2|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+

Module 12 : WHEN function in PySpark¶

In [150]:

df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')

In [151]:

from pyspark.sql.functions import sum,avg,max,min,mean,count
summary_data  = df.groupBy('browser','sex').agg(sum('class').alias('sum_class') \
                                                      , mean('class').alias('avg_class') \
                                                      , min('class').alias('min_class') \
                                                      , max('class').alias('max_class') \
                                                      , count('class').alias('count_class')
                                                      , countDistinct('class').alias('count_distinct')
                                                     )
summary_data.show(5)

+-------+----+---------+-------------------+---------+---------+-----------+--------------+
|browser| sex|sum_class|          avg_class|min_class|max_class|count_class|count_distinct|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+
| Chrome|null|      1.0|0.08333333333333333|        0|        1|         12|             2|
| Chrome|   F|     11.0| 0.0658682634730539|        0|        1|        167|             2|
| Chrome|   M|     23.0|0.10550458715596331|        0|        1|        218|             2|
|FireFox|null|      1.0|                0.2|        0|        1|          5|             2|
|FireFox|   F|      3.0|0.04918032786885246|        0|        1|         61|             2|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+
only showing top 5 rows

In [152]:

summary_data = summary_data.withColumn('gender', when(summary_data.sex == "M", "Male")
                               .when(summary_data.sex == "F", "Female")
                               .when(summary_data.sex.isNull(), "Other")
                               .otherwise("check") )
summary_data.show(5)

+-------+----+---------+-------------------+---------+---------+-----------+--------------+------+
|browser| sex|sum_class|          avg_class|min_class|max_class|count_class|count_distinct|gender|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+------+
| Chrome|null|      1.0|0.08333333333333333|        0|        1|         12|             2| Other|
| Chrome|   F|     11.0| 0.0658682634730539|        0|        1|        167|             2|Female|
| Chrome|   M|     23.0|0.10550458715596331|        0|        1|        218|             2|  Male|
|FireFox|null|      1.0|                0.2|        0|        1|          5|             2| Other|
|FireFox|   F|      3.0|0.04918032786885246|        0|        1|         61|             2|Female|
+-------+----+---------+-------------------+---------+---------+-----------+--------------+------+
only showing top 5 rows

Module 13 : rank(), dense_rank(), row_number(), percent_rank() in PySpark¶

In [153]:

# These three are quite pouplar functions in SQL, which are used to create rank/counter variables

In [154]:

data_rank= spark.read.option('header','true').csv('input_datsets/Rank_data.csv')
data_rank.show()

+------+---------+-----+
|  Name|  Subject|Marks|
+------+---------+-----+
|   Sam|  Physics|   76|
|   Sam|Chemistry|   90|
|   Sam|    Maths|   90|
| Robby|  Physics|   89|
| Robby|Chemistry|   90|
| Robby|    Maths|   87|
|Merlin|  Physics|   84|
|Merlin|Chemistry|   95|
|Merlin|    Maths|   90|
+------+---------+-----+

In [155]:

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpec  = Window.partitionBy("Subject").orderBy(col("Marks").desc())
data_rank = data_rank.withColumn('Row_Number',row_number().over(windowSpec))
data_rank.show()

+------+---------+-----+----------+
|  Name|  Subject|Marks|Row_Number|
+------+---------+-----+----------+
|Merlin|Chemistry|   95|         1|
|   Sam|Chemistry|   90|         2|
| Robby|Chemistry|   90|         3|
|   Sam|    Maths|   90|         1|
|Merlin|    Maths|   90|         2|
| Robby|    Maths|   87|         3|
| Robby|  Physics|   89|         1|
|Merlin|  Physics|   84|         2|
|   Sam|  Physics|   76|         3|
+------+---------+-----+----------+

In [156]:

data_rank = data_rank.withColumn('RANK',rank().over(windowSpec))
data_rank.show()

+------+---------+-----+----------+----+
|  Name|  Subject|Marks|Row_Number|RANK|
+------+---------+-----+----------+----+
|Merlin|Chemistry|   95|         1|   1|
|   Sam|Chemistry|   90|         2|   2|
| Robby|Chemistry|   90|         3|   2|
|   Sam|    Maths|   90|         1|   1|
|Merlin|    Maths|   90|         2|   1|
| Robby|    Maths|   87|         3|   3|
| Robby|  Physics|   89|         1|   1|
|Merlin|  Physics|   84|         2|   2|
|   Sam|  Physics|   76|         3|   3|
+------+---------+-----+----------+----+

In [157]:

data_rank = data_rank.withColumn('DENSE_RANK',dense_rank().over(windowSpec))
data_rank.show()

+------+---------+-----+----------+----+----------+
|  Name|  Subject|Marks|Row_Number|RANK|DENSE_RANK|
+------+---------+-----+----------+----+----------+
|Merlin|Chemistry|   95|         1|   1|         1|
|   Sam|Chemistry|   90|         2|   2|         2|
| Robby|Chemistry|   90|         3|   2|         2|
|   Sam|    Maths|   90|         1|   1|         1|
|Merlin|    Maths|   90|         2|   1|         1|
| Robby|    Maths|   87|         3|   3|         2|
| Robby|  Physics|   89|         1|   1|         1|
|Merlin|  Physics|   84|         2|   2|         2|
|   Sam|  Physics|   76|         3|   3|         3|
+------+---------+-----+----------+----+----------+

In [158]:

data_rank = data_rank.withColumn('PERCENT_RANK',percent_rank().over(windowSpec))
data_rank.show()

+------+---------+-----+----------+----+----------+------------+
|  Name|  Subject|Marks|Row_Number|RANK|DENSE_RANK|PERCENT_RANK|
+------+---------+-----+----------+----+----------+------------+
|Merlin|Chemistry|   95|         1|   1|         1|         0.0|
|   Sam|Chemistry|   90|         2|   2|         2|         0.5|
| Robby|Chemistry|   90|         3|   2|         2|         0.5|
|   Sam|    Maths|   90|         1|   1|         1|         0.0|
|Merlin|    Maths|   90|         2|   1|         1|         0.0|
| Robby|    Maths|   87|         3|   3|         2|         1.0|
| Robby|  Physics|   89|         1|   1|         1|         0.0|
|Merlin|  Physics|   84|         2|   2|         2|         0.5|
|   Sam|  Physics|   76|         3|   3|         3|         1.0|
+------+---------+-----+----------+----+----------+------------+

In [159]:

data_rank = data_rank.withColumn('NTILE',ntile(2).over(windowSpec))
data_rank.show()

+------+---------+-----+----------+----+----------+------------+-----+
|  Name|  Subject|Marks|Row_Number|RANK|DENSE_RANK|PERCENT_RANK|NTILE|
+------+---------+-----+----------+----+----------+------------+-----+
|Merlin|Chemistry|   95|         1|   1|         1|         0.0|    1|
|   Sam|Chemistry|   90|         2|   2|         2|         0.5|    1|
| Robby|Chemistry|   90|         3|   2|         2|         0.5|    2|
|   Sam|    Maths|   90|         1|   1|         1|         0.0|    1|
|Merlin|    Maths|   90|         2|   1|         1|         0.0|    1|
| Robby|    Maths|   87|         3|   3|         2|         1.0|    2|
| Robby|  Physics|   89|         1|   1|         1|         0.0|    1|
|Merlin|  Physics|   84|         2|   2|         2|         0.5|    1|
|   Sam|  Physics|   76|         3|   3|         3|         1.0|    2|
+------+---------+-----+----------+----+----------+------------+-----+

In [160]:

data_rank = data_rank.withColumn('CUME_DIST',cume_dist().over(windowSpec))
data_rank.show()

+------+---------+-----+----------+----+----------+------------+-----+------------------+
|  Name|  Subject|Marks|Row_Number|RANK|DENSE_RANK|PERCENT_RANK|NTILE|         CUME_DIST|
+------+---------+-----+----------+----+----------+------------+-----+------------------+
|Merlin|Chemistry|   95|         1|   1|         1|         0.0|    1|0.3333333333333333|
|   Sam|Chemistry|   90|         2|   2|         2|         0.5|    1|               1.0|
| Robby|Chemistry|   90|         3|   2|         2|         0.5|    2|               1.0|
|   Sam|    Maths|   90|         1|   1|         1|         0.0|    1|0.6666666666666666|
|Merlin|    Maths|   90|         2|   1|         1|         0.0|    1|0.6666666666666666|
| Robby|    Maths|   87|         3|   3|         2|         1.0|    2|               1.0|
| Robby|  Physics|   89|         1|   1|         1|         0.0|    1|0.3333333333333333|
|Merlin|  Physics|   84|         2|   2|         2|         0.5|    1|0.6666666666666666|
|   Sam|  Physics|   76|         3|   3|         3|         1.0|    2|               1.0|
+------+---------+-----+----------+----+----------+------------+-----+------------------+

Module 14 : Data management using PySpark¶

1. Merging Data¶

Merging data involves combining two or more data sets into a single data set based on common columns or indexes.¶

In [161]:

import pandas as pd
# create two data frames
df1 = pd.DataFrame({'id': ['1001', '1002', '1003'], 'name':   ['abc', 'mno', 'opq']})
df2 = pd.DataFrame({'id': ['1001', '1002', '1004'], 'salary': [1100.25, 2200.50, 3300.20]})

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df_py1 = spark.createDataFrame(df1)
df_py2 = spark.createDataFrame(df2)

In [162]:

df_py1.show()

+----+----+
|  id|name|
+----+----+
|1001| abc|
|1002| mno|
|1003| opq|
+----+----+

In [163]:

df_py2.show()

+----+-------+
|  id| salary|
+----+-------+
|1001|1100.25|
|1002| 2200.5|
|1004| 3300.2|
+----+-------+

In [164]:

inner_data = df_py1.join(df_py2,df_py1.id ==  df_py2.id,"inner")
inner_data.show()

+----+----+----+-------+
|  id|name|  id| salary|
+----+----+----+-------+
|1001| abc|1001|1100.25|
|1002| mno|1002| 2200.5|
+----+----+----+-------+

In [165]:

left_data = df_py1.join(df_py2,df_py1.id ==  df_py2.id,"left")
left_data.show()

+----+----+----+-------+
|  id|name|  id| salary|
+----+----+----+-------+
|1003| opq|null|   null|
|1002| mno|1002| 2200.5|
|1001| abc|1001|1100.25|
+----+----+----+-------+

In [166]:

left_data = df_py1.join(df_py2,df_py1.id ==  df_py2.id,"full")
left_data.show()

+----+----+----+-------+
|  id|name|  id| salary|
+----+----+----+-------+
|1001| abc|1001|1100.25|
|1002| mno|1002| 2200.5|
|1003| opq|null|   null|
|null|null|1004| 3300.2|
+----+----+----+-------+

Simpler way is to use SQL Join to write complex joins¶

It has been my experience that joining is most vulnerable data operation, I have seen that most of the data goes wrong while joining multiple data tables.¶

In [167]:

#First dataframes should be registed to SQL
df_py1.createOrReplaceTempView("SQL_1")
df_py2.createOrReplaceTempView("SQL_2")

#Now data tables can be joined usng classical SQL queries:
joined_data1 = spark.sql("select coalesce(a.id, b.id) as ID \
        , a.name \
        , b.salary \
        from SQL_1 a \
        full join SQL_2 b \
        on a.id == b.id" \
         )

In [168]:

joined_data1.show()

+----+----+-------+
|  ID|name| salary|
+----+----+-------+
|1001| abc|1100.25|
|1002| mno| 2200.5|
|1003| opq|   null|
|1004|null| 3300.2|
+----+----+-------+

2. Appending Data¶

Appending means vertical stacking of data below each other¶

In [169]:

import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': ['1', '2', '3', '4']})
df2 = pd.DataFrame({'col1': [5, 6, 7], 'col3': [1005, 1006, 1007]})

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df_py1 = spark.createDataFrame(df1)
df_py2 = spark.createDataFrame(df2)

In [170]:

df_py1.show()

+----+----+
|col1|col2|
+----+----+
|   1|   1|
|   2|   2|
|   3|   3|
|   4|   4|
+----+----+

In [171]:

df_py2.show()

+----+----+
|col1|col3|
+----+----+
|   5|1005|
|   6|1006|
|   7|1007|
+----+----+

In [172]:

union_DF = df_py2.union(df_py1)
union_DF.show()

+----+----+
|col1|col3|
+----+----+
|   5|1005|
|   6|1006|
|   7|1007|
|   1|   1|
|   2|   2|
|   3|   3|
|   4|   4|
+----+----+

In [173]:

union_DF.dtypes

Out[173]:

[('col1', 'bigint'), ('col3', 'string')]

In [174]:

union_DF = df_py1.union(df_py2)
union_DF.show()

+----+----+
|col1|col2|
+----+----+
|   1|   1|
|   2|   2|
|   3|   3|
|   4|   4|
|   5|1005|
|   6|1006|
|   7|1007|
+----+----+

In [175]:

union_DF.dtypes

Out[175]:

[('col1', 'bigint'), ('col2', 'string')]

Module 15 : Let's revise - how to write SQL Qery in pyspark¶

In [176]:

df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')

# First we need to register DataFrame as a SQL temporary view
sample_1 = df.select('user_id','browser','sex','class')
sample_1.createOrReplaceTempView("sample_sql")

# You can write query in multiple lines, using a \, which defines End of the Line
new_data = spark.sql(
          "SELECT sex \
          , class \
          , class+10 as class_plus_10 \
          , class * 2 as class_multiplied_2 \
          , 1- class  as class_inversed \
          , class/2  as class_divided \
          , pow(class+2,  2)  as class_power \
          ,  (class+3) % 2 as mod_class \
          FROM sample_sql"
          
         )

In [177]:

new_data.show(5)

+---+-----+-------------+------------------+--------------+-------------+-----------+---------+
|sex|class|class_plus_10|class_multiplied_2|class_inversed|class_divided|class_power|mod_class|
+---+-----+-------------+------------------+--------------+-------------+-----------+---------+
|  M|    0|         10.0|               0.0|           1.0|          0.0|        4.0|      1.0|
|  F|    0|         10.0|               0.0|           1.0|          0.0|        4.0|      1.0|
|  M|    1|         11.0|               2.0|           0.0|          0.5|        9.0|      0.0|
|  M|    0|         10.0|               0.0|           1.0|          0.0|        4.0|      1.0|
|  M|    0|         10.0|               0.0|           1.0|          0.0|        4.0|      1.0|
+---+-----+-------------+------------------+--------------+-------------+-----------+---------+
only showing top 5 rows

In [178]:

new_data = spark.sql(
          "SELECT sex \
          , browser \
          , sum(class) as class_sum \
          FROM sample_sql \
          where sex = 'M' \
          group by sex , browser \
          having class_sum > 13 \
          "
          )
new_data.show()

+---+-------+---------+
|sex|browser|class_sum|
+---+-------+---------+
|  M|     IE|     17.0|
|  M| Chrome|     23.0|
+---+-------+---------+

Module 16 : Miscellaneous¶

In [179]:

# Summarize data in SAS Proc Transpose or Pivot Table of Microsoft Excel

In [180]:

df = spark.read.option('header','true').csv('input_datsets/Sample_Fraud_Detection.csv')
sample_1 = df.select('user_id','browser','sex','class')
sample_1.show(5)

+-------+-------+---+-----+
|user_id|browser|sex|class|
+-------+-------+---+-----+
|  22058| Chrome|  M|    0|
| 333320| Chrome|  F|    0|
|   1359|  Opera|  M|    1|
| 150084| Safari|  M|    0|
| 221365| Safari|  M|    0|
+-------+-------+---+-----+
only showing top 5 rows

In [181]:

sample_1 = sample_1.withColumn("class", col("class").cast("integer"))

In [182]:

transpose = sample_1.groupBy("browser").pivot("sex").sum("class")
transpose.show()

+-------+----+---+---+
|browser|null|  F|  M|
+-------+----+---+---+
|FireFox|   1|  3| 11|
| Safari|null|  4| 13|
|     IE|   1| 12| 17|
| Chrome|   1| 11| 23|
|  Opera|   0|  2|  2|
+-------+----+---+---+

In [183]:

transpose = sample_1.groupBy("browser").pivot("sex",["F","M"]).sum("class")
transpose.show()

+-------+---+---+
|browser|  F|  M|
+-------+---+---+
|FireFox|  3| 11|
| Safari|  4| 13|
|     IE| 12| 17|
| Chrome| 11| 23|
|  Opera|  2|  2|
+-------+---+---+

In [184]:

transpose = sample_1[(df.browser == 'Chrome') | (df.browser == "Safari")].groupBy("browser").pivot("sex",["F","M"]).sum("class")
transpose.show()

+-------+---+---+
|browser|  F|  M|
+-------+---+---+
| Safari|  4| 13|
| Chrome| 11| 23|
+-------+---+---+

In [185]:

transpose = sample_1.groupBy("browser").pivot("sex",["F","M"]).sum("class")
transpose.show()
#Sorting data in descending order of crowser and asceding order of F,
# in this case second one is redundant, but it is to demonstrated that we can use multiple columns at once
transpose = transpose.orderBy(transpose.browser.desc(), transpose.F)
transpose.show()

+-------+---+---+
|browser|  F|  M|
+-------+---+---+
|FireFox|  3| 11|
| Safari|  4| 13|
|     IE| 12| 17|
| Chrome| 11| 23|
|  Opera|  2|  2|
+-------+---+---+

+-------+---+---+
|browser|  F|  M|
+-------+---+---+
| Safari|  4| 13|
|  Opera|  2|  2|
|     IE| 12| 17|
|FireFox|  3| 11|
| Chrome| 11| 23|
+-------+---+---+

Pages

PySpark