Proc SQL vs. Data Step in SAS - Part 2


Join (Proc SQL) vs. Merge (Data step)

Your wait is over. Let's deep dive a little into the background working algorithm of Join and Merge. This one is quite interesting, here efficiency as well as the result may vary ...

Please go through the first part of this else it would be like start watching a movie at interval.



In Data step merge, data is merged point to point.

Suppose we merge following two datasets, without even a common field, check how it merges:

Data ABC;
input name_1 $ marks_1 ;
cards;
john 10
jack 20
jim  30
;
Run;

Data PQR;
input name_2 $ marks_2 ;                       The result would be :
cards;
walker 100
daniel 200
beam  300
;
Run;

Data final;
Merge ABC PQR;
proc print;
Run;

You can see that it merged the data point to point, based on position of each observation, however if we give matching field, SAS would ask you to first sort data sets based on the matching field and then again would merge data point to point position basis. Now we know how the data step merge works.

Let's see now, how SQL join would behave in the first example.

Data ABC;
input name_1 $ marks_1 ;
cards;
john 10
jack 20                                                       The result would be :
jim  30
;
Run;

Data PQR;
input name_2 $ marks_2 ;              
cards;
walker 100
daniel 200
beam  300
;
Run;

Proc SQL;
select * from ABC , PQR;
Quit;

What's that ... yes.... that's basic difference between Data step and a SQL join.

In SQL join, SAS first makes a Cartesian product (all possible combinations) as it made in above case. it then subsets this data based on the matching field condition given ... which was not present in the above code.

As in Proc SQL join, first SAS makes the Cartesian product and then subsets, suppose your datasets have 1 million (1000000) observations each. It would first make a data having :

1000000 X 1000000 = 1000000000000 observations

and then it would subset the data on the basis of matching field condition given. It is very time consuming and very much more memory consuming, which sometimes becomes a showstopper. This is the point number 1.

Let's see one more case by example.
*_____________________________________________________________________;
Data name_class;
input name : $10. Class $;
cards;
Ram 10
Shyam 12
Sita 10
Geeta 12
;
Run;


Data name_Age;
input name : $10.  Age $;
cards;
Ram 20
Shyam 22
Balram 25
Sita 18
Geeta 18
;
Run;

proc sort data = name_class;  by name; run;
proc sort data = name_age ;  by name; run;
Data Merged;
merge name_class name_age;
by name;
proc print;
Run;

Proc SQL;
create table joined as select * from name_class as a full join name_age as b
on a.name = b.name;
quit;
proc print data = joined;run;
*_____________________________________________________________________;


Could you notice the contrast, the extra element in the second data appeared in the data, but the name was missing. So we have to be extra careful while using Proc SQL join. This is point number two.

Better Join code should be written this way :
*_____________________________________________________________________;
Proc SQL;
create table joined as select coalesce(a.name, b.name) as name, class, age from name_class as a full join name_age as b
on a.name = b.name;
quit;
proc print data = joined;run;
*_____________________________________________________________________;

Enjoy reading our other articles and stay tuned with ...

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.