Subscribe to DSC Newsletter

Big data not so big: Parallel thread processing in SAS

Many a times, data is presumed to be big.  The Big data that cannot be handled effectively by traditional software are usually images, weblogs, Facebook feeds, tweets and internet of data in vast magnitude.   In real case scenarios, servers with monstrous configurations are threaded in parallel to create redundancy and ability to handle multiple requests in structured and semi structured data.  In this article I will talk briefly about using parallel thread processing in base SAS to process datasets in order of billion rows.  Along with parallel thread processing, hash joins, inner joins and views are also used where applicable to improve processing time.   Also check out 10 ways to optimize sas code.

Example configuration contains 5 servers connected in parallel thread architecture where SAS creates nodes for each instance initiated from either base SAS or through SAS EG.   Also the data layer we are discussing below resides on scalable performance data server architecture or SPDS.   For example if I am running 5 data intensive queries and data cleansing, I start with signon command to initiate sessions ( 5 sessions opened on 5 individual server nodes).  One caveat to this is usually SASgrid is setup in such a way that each session is posted to receiving server node and if submitted at same time, it will distribute on available nodes (might or might not spread on 5 different servers). 

%macro start_sessions(count); /* Here count can be an automatic calculation or a fixed number.*/ /*here count=5*/

      %do i = 1 %to &count;

            signon sess&i signonwait=no connectwait=no cmacvar=scaproc_signon_&i;

      %end;

%mend start_sessions;

Here by starting a new session, all previous macro variables, variables and other temporary data is lost.  Sasrc file can be used to automatically assign system credentials and path to some common marco variables but we should use %syslput to specifically invoke a previous session variable into the new session. Next step would be to process a session as below

rsubmit sess1 wait=no; 

%include "&app_path./xxx.sas"; /*this will contain the macro codes*/

%call_specific_macro_here (x=value);     

endrsubmit;

If there are ‘n’ more sessions, we can refine the code using macro facility as below.

%macro submit(count); 

      %do i = 1 %to &count;

rsubmit sess&i wait=no;

%syslput k=i;

%include "&app_path./xxxx.sas";

      % call_specific_macro_here (tt=a|b|c|d|e,j=&&&k.);/*here macro variable tt is passed to the xxxx.sas code which contain macros which will divide a|b|c|d|e into 5 variable with specific functionality */

endrsubmit;

%end;

%mend submit;

waitfor _all_ sess1 sess2 sess3 sess4 sess5;

 

‘waitfor’ statement can be included in macro as well if you are coding by ‘n’.  Later the sessions are all closed. 

 

%macro scagrid_waitfors(count);

      %do i = 1 %to &count;

            signoff sess&i;

      %end;

%mend scagrid_waitfors;

When processing a large dataset, _n_ can be calculated and can be divided into multiples of say 1 million rows and spread on the parallel nodes.  Macro facility can be used to define the predefined analysis process and can be input with segregated data.  Example, 1 to one millionth row runs on node1, 1 millionth – 2 millionth rows runs on node2 and so on.  Note that all endrsubmit will wait till all the submitted data is processed.  Later the analyst can decide if he wants to move all the data into one dataset using PROC APPEND (usually expensive) or create a cluster on the processed datasets residing on SPDS data layer.  Notes on clustering and un-clustering can be found here

Combining a mix of technical expertise in SAS, data mining and Hadoop technologies like Pig, Hive, Hbase and MapReduce, we can bring out analytical insights with greater inclusion of determinants coming from claims, adjudication data, diagnosis codes data, health sensors data, census data, weblogs and customer reviews etc. for solving a specific predictive analytical problem in health care.

Views: 3161

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service