Data exploration and Sub-setting in Python

Python Tutorial 1.0


It took me long to write the next article, I apologize for it. I was much occupied in routine task, but now I would try to be regular with my blogging practice.

In the previous article on Python, we covered few basics of Python. In this article, we would learn "How to explore and subset the data in Python.


# Let's first create a data with 5 columns and 8 observations

import pandas as p

Namelist = ["A","B","C","D","E","F","G","H"]
Agelist = [22,23,24,20,19,24,22,23]
Genderlist= ["M","F","M","F","M","F","M","F"]
Earninglist= [800,700,500,1000,1100,800,700,600]
Expenselist = [100,110,120,110,130,90,100,80]

Data_1 = {'Name': Namelist,
          'Age': Agelist,
          'Gender' : Genderlist,
          'Earning': Earninglist,
          'Expense': Expenselist}

Data_2 = p.DataFrame(Data_1,columns=['Name','Age','Gender','Earning','Expense'])



Now I want to check how many rows and columns are there in the data.

Code:     Data_2.shape

Here is the answer in console :

 (8, 5)

It means 8 rows and 5 columns.



I want to check sample of the data.



Code :   
Data_2.head()     # will show top 5 observations of data
or
Data_2.tail()       # will show bottom 5 observations of data

and you can see the sample.

you can specify the number of observations.


for example : Data_2.head(10) will show top 10 observations of data.



I just want the columns' names of the data.




Code :     Data_2.columns

Here is the answer in console :

Out[1]: Index(['Name', 'Age', 'Gender', 'Earning', 'Expense'], dtype='object')




I want to see specific rows/ columns of the data.

Remember Indexing Rule : In python indexing start from 0.So, first row or column is referred by 0.



We can select range of data using both either labels or integer base indexing:

loc : integer and label based selection
iloc : integer based selection


Code :     Data_2.iloc[1:4,::]

Will show 2nd,3rd and 4th observation and all columns

Code :     Data_2.iloc[::,[0,2]]

Will show all rows of 1st and 3rd column


Code :     Data_2.iloc[[1,2,3], 0:3]
Will show 2nd, 3rd and 4th observation of 1st to 3rd columns.


Want to select all rows and two columns 'Name' and 'Gender' :
     
Data_2.loc[:,['Name','Gender']]                   
 # We need to use loc argument to  specify label base indexing





Keeping variables :


Suppose you want to keep 1st to 3rd variables 
Data_3 = Data_2.iloc[: ,[0,2]]

We can specify the name of the columns

Data_2.loc[:,['Name','Gender']]                    
# We need to use loc argument to  specify label base indexing


Simplest way is :

Data_2[['Age','Gender']]



Dropping variables :

Suppose you want to drop 1st to 3rd variables 
Data_4 = Data_2.drop(['Name','Gender'],axis=1)







Adding variable :
Suppose you want to add new variable or column
Data_2['Saving'] = [700, 590, 380, 890, 970, 710, 600, 520]



Creating new variable using existing once :


Data_2['New']=Data_2['Earning']-Data_2['Expense']









All right ... Let's practice logical sub-setting now.

& is used for AND logical operations.
|   is used for OR logical operations.


Let's try to answer few Questions :


1. Select the members whose earning is 500.


Data_2[(Data_2.Earning==500)]



2. Select the members whose earning is grater than 1000 and age >= 20.



Data_2[(Data_2.Earning>=1000)&(Data_2.Age>=20)]


3. Select the the data for members A, B and C.


Data_2[Data_2['Name'].isin(['A','B','C'])]


4. Select the the data for all the members  except A, B and C.

Data_2[~Data_2['Name'].isin(['A','B','C'])]


5. Select the data of all the rows in which gender not equal to 'M'. 


Data_2[~Data_2['Gender'].isin(['M'])]




Finishing article here itself, in the next artcile we would be covering the "How to modify data with IF-Else conditions in Python", till then ...

Enjoy reading our other articles and stay tuned with us.


Kindly do provide your feedback in the 'Comments' Section and share as much as possible.