Use the correct "Bullet" while sorting a very large data

Options available with Proc Sort in SAS

Case 1 : Assume that we need to sort a very large dataset, say a 100 GB dataset, and during sorting SAS gives an error of "not enough disk space" and "I/O error", please be clear that you disk space is not sufficient.

Case 2 :  You are sorting a very large datasets and you have sufficient disk space, but you want to expedite the processing.

SAS provides you bullets (options) for all your needs...


Case 1 : 

When we sort a dataset in SAS, SAS first creates an intermediate temporary file, in which it populates the data in sorted manner. Once this intermediate data is complete, SAS populates the data in the final output dataset. 

Imagine size of a dataset is 100 GB ( even after using compress = Yes option, it is easily possible in BFSI domain), so it would roughly consume 100 GB for intermediate file and 100 GB For final file.
So in your working folder location, minimum 200 GB space should be free.

While we use tagsort option in Proc sort, it would only take by variables and observation ID into intermediate dataset. Once the intermediate data is complete, data flows to the final output dataset. During this finalization step, data from initial dataset is "lookedup" with the help of  observation ID and get populated in the final dataset.



How to code:

*_____________________________________________________________________;

Proc sort data = dataset_name out = final_ouput_data tagsort; by x y ; Run;
*_____________________________________________________________________;


Disclaimer : Tagsort option optimizes the disk space consumption only, not time taken. In fact it takes time marginally on higher side.


Case 2: 

You are enjoying enough disk space, but want to expedite the processing, you can use Threads option.

The Threads option makes SAS use multiple CPUs simultaneously by executing multiple threads (parts of data) in parallel. It takes less real time (of your interest), but higher CPU time ( no of your interest) for processing.
*_____________________________________________________________________;

Proc sort data = dataset_name out = final_ouput_data Threads; by x y ; Run;

*_____________________________________________________________________;

Important note :

Threads and Tagsort options should not be used together as TAGSORT option prevents multi-threaded processing.

Enjoy reading our other articles and stay tuned with ...

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.

No comments:

Post a Comment

Do provide us your feedback, it would help us serve your better.