Ask Analytics: Data exploration and Sub-settng in R

R Tutorial 3.0

In the series of R tutorials, we have covered Basics of R and Importing -Exporting of data, so far. Now while the data resides in the R environment, let's do something with this data. Shall we explore few functions for exploring and sub-setting dataset ?

Let's use an in-built data of R for this tutorial.

Create a new data "Data_1" as a copy of "iris" data. use the simple R syntax :

Data_1 = iris

Voila! the data is created.

I want to check how many rows and columns are there in the data.

Code: dim(Data_1)

Here is the answer in console :

[1] 150 5

It means 150 rows and 5 columns.

I want to check sample of the data.

Code :

head(Data_1) # will show top 6 observations of data
or
tail(Data_1) # will show bottom 6 observations of data

and you can see the sample.

I just want the columns' names of the data.

Code : names(Data_1)

Here is the answer in console :

[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

I want to see specific rows/ columns of the data.

Code : Data_1[c(2,3,4),]
Will show 2nd, 3rd and 4th observation and all columns

Code : Data_1[,c(1,3)]
Will show all rows of 1st and 3rd column

Code : Data_1[c(2,3,4), 1:3]
Will show 2nd, 3rd and 4th observation of 1st to 3rd columns.

Using above syntax, we can perform vertical sub-setting on a data i.e. keeping or dropping variables.

Keeping variables :

Suppose you want to keep 2nd to 4th variables
Data_2 = Data_1[ , 2:4]

You can also keep specifying names of column:
Data_3 = Data_1[ , c("Sepal.Width", "Petal.Length","Petal.Width")]

The same can be written in indirect method or vector method :
to_keep = c("Sepal.Width", "Petal.Length","Petal.Width")
Data_3 = Data_1[ , to_keep]

Doesn't it look like macro of SAS ?

Dropping variables :

Suppose you want to drop 1st to 3rd variables
Data_4 = Data_1[ , -( 1:3)]

You can also keep specifying names of column:
to_drop = names(Data_1)%in%c("Sepal.Width","Petal.Width")
Data_5 = Data_1[!to_drop]

Clear everything with
rm(list = ls())

All right ... Let's practice logical sub-setting now.

Create a data from Data_1 with observation pertaining to species setosa only

setosa = Data_1[ Species == "setosa" , ]

OMG ! it is giving an error.

Error in `[.data.frame`(Data_1, Species == "setosa", ) :
object 'Species' not found

Basically R is not able to identify the vector Species which is within Data_1. Now to let R know, there are two ways to do :

1. We specifically tell R that "Species" is within the data "Data_1" using $.

setosa = Data_1[ Data_1$Species == "setosa" , ]

2. We first load the data "Data_1" in the temporary memory of R

attach(Data_1)

setosa = Data_1[Species == "setosa" , ]

Now it would work fine.

Create a data from Data_1 for Species = setosa and petal length >= 1.4

attach(Data_1)

setosa_1 = Data_1[which(Species == "setosa" & Petal.Length >= 1.4),]

& is used for AND logical operations.

| is used for OR logical operations.

Create a data from Data_1 for Species = setosa and virginica

attach(Data_1)

setosa_virginica = Data_1[which(Species == "setosa" | Species == "virginica") ,]

**using which is not mandatory here

The logical sub-setting can also be done using "subset" function.

To get only Setosa data where Petal length is more than or equal to 1.4.

setosa_only = subset(Data_1,(Species == "setosa" & Petal.Length >= 1.4))

To get data for setoda and virginica species data, but only two columns:

setosa_virginica = subset(Data_1,(Species == "setosa" | Species == "virginica"), select =c("Species","Petal.Length"))

That's enough to know about sub-setting for now. Enjoy reading our other articles and stay tuned with us.

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.

Pages

Data exploration and Sub-settng in R

Keeping variables :

Dropping variables :

All right ... Let's practice logical sub-setting now.

The logical sub-setting can also be done using "subset" function.

No comments:

Post a Comment