Ask Analytics: Difference between Nodupkey and Nodup in Proc Sort ?

What is the difference between the Nodupkey and Nodup options in Proc Sort ?

Since ages SAS interviewers have not stopped asking this question ... and of all the people whom I have interviewed, none has given a satisfactory answer to me.

What exactly should one reply ... not only to answer, but also to impress ?

Well,

Nodup removes contiguous duplicate records post sorting the data on the variable(s) listed with by statement in Proc Sort.

Nodupkey, on the other hand, removes the observations duplicate in data just on the basis of variable(s) listed with by statement in Proc Sort. Basically the variable(s) in the by statement is considered as "key" and hence Nodupkey removes duplicate keys.

Confused ??? No worries ... Let's understand this by example :

Suppose we have following dataset :

*--------------------------------------------------------------------;

Data Sample;

input name $ X Y Z;

cards;

A 1 2 3

A 4 5 6

B 1 2 3

A 1 2 3

;

Run;

*--------------------------------------------------------------------;

Let's use option Nodup ...

*--------------------------------------------------------------------;

proc sort data = Sample nodup;by name;run;

*--------------------------------------------------------------------;

The data contains the first and last records completely duplicate. While we run the above code, data is just sorted on the basis on name. The resultant data looks like :

As you can see the duplicate record (initially 4th one) is now at 3rd position and has not been removed off the dataset.

Why ... while we have already used "nodup" ?

Because, the sorted data was not having duplicate record next to each other, the duplicate record was not removed. we have already stated that "Nodup" removes only contiguous duplicate records.

Now try this one :

*--------------------------------------------------------------------;

proc sort data = Sample nodup;by name X;run;

*--------------------------------------------------------------------;

Definitely the duplicate record would be removed as now Proc Sort would make the duplicate record adjacent to each other.

I think now we clear about "Nodup" ...