Proc SQL vs. Data Step in SAS - Part 1

How Data Step in SAS and Proc SQL differ and which one should I use and where ?

Is that even a big question that one should bother about ? Leaving syntax, is there even a difference between two? Probably, you must be thinking about these points, Right ?

Of course, you need to ask these question ... so let me explain the key differences between the two and suggest you the best practice.
In my career, I have met several SAS coders. If I start classifying them into categories, I can classify them in many ways :

1. Mr. A : Simple coder    &   Mr. B :  Complicated coder ( I hate B types )
2. Mr. X : Data step freak &   Mr. Y  : Proc SQL freak   ( I am an X type)

Talking of first classification, for any task Mr. A writes code in pieces, however Mr. B writes it in a single piece  ... Will show you guys example some other time. For now let's focus on second classification.

Talking of the second classification, i.e. Mr. X and Mr. Y, these guys do the same things but Mr. X uses Data step for almost everything, unless he is rolling up the data. Mr. Y, on the on the other hand, writes almost everything using Proc SQL.

... per say ...

Mr. X write uses Keep a b  in data step while Mr.Y uses select a, b statement in Proc SQL.

For a derived variable or flag creation, Mr. X writes uses If ... Else if statement, Mr. Y would do the same thing with case...when statements in Proc SQL.

Mr. X write uses merge  in data step while Mr.Y uses join statement in Proc SQL.

so on ...

So what ? Principally there is no issue as such, both are Okay .... it is a matter of coding practice. There is no problem in being Mr. X or Mr. Y, mostly difference comes in terms of efficiency, which is very sometimes, especially in BFSI industry where they deal with large datasets. Don't believe me ... believe your own eyes, check the following examples :


1.  If ... else if vs. case when statements

*_____________________________________________________________________;
Data check ;
do i = 1 to 10000000;
output;
end;
Run;

Data check_1;
set check ;
if mod(i,2) = 0 then flag = "Even";
else flag = "Odd";
Run;


Proc SQL;
create table check_2 as 
select i , (case when mod(i,2) = 0 then "Even" else "odd" end) as Flag
from check ;
Quit;

*_____________________________________________________________________;

Try the above code and check the real and CPU time in log, and also try to time it yourself. Check how much time it took SAS to process the codes. There is definitely a slight difference, Data step has proved itself to be more efficient.


Increase one more 0 in the do loop and now see the difference.This gap widens for sure.


2.  Join (Proc SQL) vs. Merge (Data step)

This one is quite interesting, here efficiency as well as the result may vary ...

How ?  .............. wait don't get impatient ... for this first we need to deep dive a little into the background working algorithm of Join and Merge.

---------------------------------- Click here for the next part---------------------------------------

No comments:

Post a Comment

Do provide us your feedback, it would help us serve your better.