Data exploration and Sub-settng in R

R Tutorial 3.0

In the series of R tutorials, we have covered Basics of R and Importing -Exporting of data, so far. Now while the data resides in the R environment, let's do something with this data.  Shall we explore few functions for exploring and sub-setting dataset ?

Let's use an in-built data of R for this tutorial.

Create a new data "Data_1" as a copy of "iris" data. use the simple R syntax :

Data_1 = iris

Voila! the data is created.

I want to check how many rows and columns are there in the data.

Code:     dim(Data_1)

Here is the answer in console :

[1] 150   5

It means 150 rows and 5 columns.

I want to check sample of the data.

Code :  

head(Data_1)     # will show top 6 observations of data
tail(Data_1)       # will show bottom 6 observations of data

and you can see the sample.

I just want the columns' names of the data.

Code :     names(Data_1)

Here is the answer in console :

[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"


I want to see specific rows/ columns of the data.

Code :     Data_1[c(2,3,4),]
Will show 2nd, 3rd and 4th observation and all columns

Code :     Data_1[,c(1,3)]
Will show all rows of 1st and 3rd column

Code :     Data_1[c(2,3,4), 1:3]
Will show 2nd, 3rd and 4th observation of 1st to 3rd columns.

Using above syntax, we can perform vertical sub-setting on a data i.e. keeping or dropping variables.

Keeping variables :

Suppose you want to keep 2nd to 4th variables 
Data_2 = Data_1[  ,   2:4]

You can also keep specifying names of column:
Data_3 = Data_1[  , c("Sepal.Width", "Petal.Length","Petal.Width")]

The same can be written in indirect method or vector method :
to_keep = c("Sepal.Width", "Petal.Length","Petal.Width")
Data_3 = Data_1[  , to_keep]

Doesn't it look like macro of SAS ?

Dropping variables :

Suppose you want to drop 1st to 3rd variables 
Data_4 = Data_1[  ,  -( 1:3)]

You can also keep specifying names of column:
to_drop = names(Data_1)%in%c("Sepal.Width","Petal.Width")
Data_5 = Data_1[!to_drop]

Clear everything with
rm(list = ls())

All right ... Let's practice logical sub-setting now.

Create a data from Data_1 with observation pertaining to species setosa only 

setosa = Data_1[ Species == "setosa" , ]

OMG ! it is giving an error.

Error in `[.data.frame`(Data_1, Species == "setosa", ) : 
object 'Species' not found

Basically R is not able to identify the vector Species which is within Data_1. Now to let R know, there are two ways to do :

1. We specifically tell R that "Species" is within the data "Data_1" using $.

setosa = Data_1[ Data_1$Species == "setosa" , ]

2. We first load the data "Data_1" in the temporary memory of R

setosa = Data_1[Species == "setosa" , ]

Now it would work fine.

Create a data from Data_1 for Species = setosa and petal length >= 1.4

setosa_1 = Data_1[which(Species == "setosa" & Petal.Length >= 1.4),]

& is used for AND logical operations.
|   is used for OR logical operations.

Create a data from Data_1 for Species = setosa and virginica

setosa_virginica = Data_1[which(Species == "setosa" | Species == "virginica") ,]

**using which is not mandatory here

The logical sub-setting can also be done using "subset" function.

To get only Setosa data where Petal length is more than or equal to 1.4.

setosa_only = subset(Data_1,(Species == "setosa" & Petal.Length >= 1.4))

To get data for setoda and virginica species data, but only two columns:

setosa_virginica = subset(Data_1,(Species == "setosa" | Species == "virginica"), select =c("Species","Petal.Length"))

That's enough to know about sub-setting for now.  Enjoy reading our other articles and stay tuned with us.

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.