It looks at cluster analysis as an analysis of variance problem. This method involves an agglomerative clustering algorithm. It starts out with n clusters of size 1 and continues until all the observations are included into one cluster. This method is most appropriate for quantitative variables, and not binary variables.

1. Standardize the data; since it is based on Euclidean distance, we need to change all the risk factors into the same scale.

proc standard data=mydata mean=0 std=1 out=mydata1;

var x1 x2 ... xn;

run;

2. Determine the number of clusters to classify based on CCC plot

ods graphics on;

proc cluster data=mydata1 out=determineK method=ward ccc pseudo

plots=den

var x1 x2 ... xn;

run;

3. **F**__ or big data:__ Pick up the turning point based on the Cubic clustering criteria(CCC) plot to determine K, and then pass the K to the fastcluster procedure; f

__you can determine the cluster size by using sqrt(n/2), where n is the sample size__

**or small data:**proc fastclus data=mydata1 out=temp1 radius=0 replace=full maxclusters=K maxiter=60 mean=temp2 list distance;

var x1 x2...xn;

id personID;

run;

proc cluster method=ward outtree=tree plots=den (height=rsq);

run;

4. side-produce: if you not only want to classify them into several groups, but also want to identify outlier clusters. Then you can set some threshold for the outlier clusters, like the size of that cluster is smaller then n*0.1%.

## You need to be a member of Data Science Central to add comments!

Join Data Science Central