A to Z about Merging in R

R Tutorial 5.0

Taking up one of my favorite topics : Merging in R. As per my experience in data science, merging is a practice where analysts are most vulnerable for committing mistakes. Hence one is supposed to be quite watchful while merging two datasets; if he is not, things might go haywire.

So practice a lot, because practice only can make a man (& a woman too) perfect.

# please note, I will be using "join" and "merge" words interchangeably in the article as  there are one and the same thing in context of R (unlike SAS).

Starting with a very rudimentary example :


Suppose we have two datatset :
1. Data_Age having students name and Age
2. Data_Class having students name and Class

students = c("Rajat","Vinod","Aarya","Vertika","Shobhit")
Age = c(25,28,22,23,30)
Data_Age = data.frame(students,Age)

students = c("Aarya","Vertika","Shobhit","Rajat","Vinod")
Class = c(11,12,9,10,12)
Data_Class = data.frame(students,Class)

Let's now merge the two datasets on the matching key : students

Data_full= merge(Data_Age,Data_Class, by = "students")


Voila! It's done.

Two key points :

1.  Unlike SAS datastep merging, sorting of the datasets in not required before merging in R
2.  The result dataset is automatically sorted on "by" variable in ascending order.

Also try the following code and see the result :

Data_full= merge(Data_Age,Data_Class,  by = NULL)

It would result into the Cartesian product of the two datasets and such join in also called as cross join.

Let's now see various types of joins >>>

Should also read :   Few more things about merging in R