1.Objective
First of all we will see what is R Clustering, then we will see the Applications of Clustering, Clustering by Similarity Aggregation, use of R amap Package, Implementation of Hierarchical Clustering in R and examples of R clustering in various fields.
2. Introduction to Clustering in R
Clustering is a data segmentation technique that divides huge datasets into different groups on the basis of similarity in the data. It is a statistical operation of grouping objects. The resulting groups are clusters. Clusters have the following properties:
Clustering is one of the most widespread descriptive methods of data analysis and data mining. We use it when data volume is large to find homogeneous subsets that we can process and analyze in different ways.
For example, a food product manufacturing company can categorize its customers on the basis of purchased items and cost of those items.
Following are the main Clustering applications:
In different fields, clustering has different names, such as:
To define correct criteria for clustering and making use of efficient algorithms, the general formula is as follows:
Bn(number of partitions for n objects)>exp(n)
You can determine the complexity of clustering by the number of possible combinations of objects. The complexity of the cluster depends on the number of possible combinations of objects.
The basis for joining or separating objects is the distance between them. These distances are dissimilarity (when objects are far from each other) or similarity (when objects are close by).
To calculate the distance between the objects, we apply certain types of methods. Some of them are:
In general, for an ndimensional space, the distance is
The total sum of squares or inertia is the weighted mean of squares of the distances of each point from the center of gravity of that cluster. Thus we calculate it by adding the withincluster sum of squares with the betweencluster sum of squares.
e calculate it by adding the withincluster sum of squares with the betweencluster sum of squares.
We calculate Sum of Squares of Clusters on its center of gravity as below:
Total Sum of Squares (I) = BetweenCluster Sum of Squares (IR) + WithinCluster Sum of Squares (IA)
This formula is Huygen’s Formula and Dutch mathematician and a philosopher Christiaan Huygens gave it.
We calculate BetweenCluster Sum of Squares by finding the square of difference from the center of gravity for each cluster and then adding them. As it increases, the separation between clusters also increases indicating satisfactory clustering.
We calculate WithinCluster Sum of Squares by finding the square of difference from the center of gravity for each cluster and then adding them within in a single cluster. As it diminishes, clustering of the population becomes better.
R2 (RSQ) is the proportion of the sum of squares explained by the clusters (betweencluster sum of squares/total sum of squares). The nearer it is to 1, the better the clustering will be, but we should not aim to maximize it at all costs because this would result in the largest number of clusters: there would be one cluster per individual. So we need an R2 that is close to 1 but without too many clusters. A good rule is that, if the last significant rise in R2 occurs when we move from k to k + 1 clusters, the partition into k+1 clusters is correct.
Below are few of the properties of Efficient Clustering:
Note: In the calculation of the sum of squares, either IR is large or IA is small for correct clustering.
To distinguish true clusters in data, we often have to interpret the data before transforming, adding or excluding variables, and then restart the clustering. Excluding a variable does not necessarily mean deleting it from the analysis base. Instead, we stop to take it into account in the clustering operation, while retaining it as an inactive variable to observe the distribution of its categories in the various clusters. It is no longer an ‘active’ variable but becomes an ‘illustrative’ variable (also called ‘supplementary’ variable).
The Agglomerative hierarchical clustering (AHC) is a form of clustering that produces a sequence of nested partitions between partitions into n clusters. These nested partitions are of increasing heterogeneity. In this technique, we isolate each object and partition into 1 cluster which includes all the objects. We can use AHC if there is a distance between partitions which can be in either an individual space or a variable space. To categorize data, you must define the distance of 2 objects or clusters.
For this, the general form of the algorithm is as follows:
The tree generated by AHC is also known as dendogram. We can cut it to get clusters.
Usage of Hierarchical clustering is for identification of patterns in digital images, predicting stock exchange and in text mining and ontology. It also finds the use for research on protein sequence classification.
The working of AHC involves searching for the closest points at each step and merging them. We can define the distance in many ways but most usual definitions are as below:
i) Main Distances –
The minimum distance between the points of different clusters should be greater than the maximum distance between the points of the same cluster. These distances are the main distances.
ii) Density Estimation –
These are among the most suitable for detecting structure of complex clusters. Below are three typical methods for density estimation in clustering:
The 3 methods are effective for detecting all types of clusters irregularly shaped ones, which are of unequal sizes and have variances. The distance between 2 clusters is inversely proportional to the density in the middle of these 2 clusters.
Note: The density estimation methods specify a smoothing parameter which can be a number of clusters of kmeans, number k of neighbors of each point x or radius r of a sphere surrounding x.
This is also known as relational clustering or voting method or the Condorcet method.
Clustering by Similarity Aggregation finds use to compare all the individuals in pairs at each step to build a global clustering. It follows the principle of an equivalence relation, where iRj, if i and j are in the same cluster.
An equivalence relation has three properties, namely reflexivity, symmetry, and transitivity.
Reflexivity => Mii = 1
Symmetry => Mij = Mij
Transitivity => Mij + Mjk – Mik <=1
The clustering by similarity aggregation follows an intuitive approach. Here we assign a pair of individual values (A, B) to two vectors m(A, B) and d(A, B). m(A, B)contains the same values for both A and B, while d(A,B) contains different values for A and B.
The Condorcet criterion for two individuals A and B is as follows:
c(A, B) = m(A, B)d(A, B)
The Condorcet criterion of an individual A and a cluster S is as follows:
c(A,S) = Σic(A,Bi)
The summation is being over all the Bi ∈ S.
Given the preceding conditions, you start constructing clusters by placing each individual A in the cluster S for which c(A, S) is largest and at least 0.
Calculate Global Condorcet criterion by summing all individuals in A and clusters SA to which they have been assigned. In practice, 2 iterations will be enough to provide good results.
Note: Several iterations follow until we reach the specified largest number of iterations or the global Condorcet criterion no more improves.
For clustering by similarity aggregation, R provides the amap package. First, we load the amap package from the R library, after that, we use it for clustering.
Loading the amap Package
1

>library(amap) 
Performing Similarity Aggregation
1

> pop(matrix) 
Note: Only after transforming the data into factors and converting the values into whole numbers, we can apply similarity aggregation.
The kmeans is the most widely used method for customer segmentation of numerical data. This technique partitions n units into k ≤ n distinct clusters, S = {S1, S2, . . . , Sk }, to reduce the withincluster sum of squares. You can use kmeans function in R package stats. This algorithm is fast and reliable. But there is no guarantee that it converges to the global optimum. Its final result may depend on how is the starting of the algorithm.
For kmeans clustering, two common initialization approaches are as follows:
The kmeans clustering is the most common R clustering technique. Some of the applications of this technique are as follows:
Consider 25 European countries (n=25 units) and their protein intakes (in percent) from nine major food sources (p= 9).
Let us start with clustering on the first 2 features, protein intake from red and white meat, to cluster the 25 countries into 3 groups.
The following code applies the R clustering technique on the dataset:
a) Application of Clustering on European Protein Consumption Data
1
2
3

food < read.csv( "C:/DataMining/Data/protein.csv" ) //Reads data available in protein.csv file set.seed(1) // Fixes random starting for clusters from 1. grpMeat < kmeans(food[,c( "WhiteMeat" , "RedMeat" )], centers=3, + nstart=10) //Specifies clustering of data of red and white meat with 3 clusters. 
b) Application of Clustering on European Protein Consumption Data
1
2

o=order(grpMeat $cluster ) // Sorts the cluster on basis of number clustering data.frame(food $Country [o],grpMeat $cluster [o]) // Lists the clusters along with country name. 
Now you can use the following command to plot the scatter plot:
c) Plotting the Scatterplot for the Clusters
1

plot(food $Red , food $White , type= "n" , xlim=c(3,19), xlab = "Red Meat" ,+ ylab= "White Meat" ) 
In R, you can easily generate different numbers of clusters for the same data by assigning different values to the centers attribute in the kmeans() command.
This example analyzes monthly seasonally adjusted unemployment rates covering the period January 1976 through August 2010 for the 50 US states (n = 50).
Here you will use summary statistics for each state, such as average and standard deviation of employment rates and then use these 2 calculated features of monthly unemployment rates as attributes of clustering.
You can read unemp.csv and set seed for the cluster as done earlier.
Apply kmeans Clustering in R by below command:
1

grpunemp < kmeans(unemp[,c( "mean" , "stddev" )], centers=3, + nstart=10) 
Sort the clusters by below way:
1

o=order(grpunemp $cluster ) 
You can now plot the clusters on the graph by using below command:
1
2
3

plot(unemp $mean ,unemp $stddev ,type= "n" ,xlab= "mean" , + ylab= "stddev" ) text(x=unemp $mean ,y=unemp $stddev ,labels=unemp $state ,+ col=grpunemp $cluster +1) 
Hierarchical clustering is an approach of clustering n units wherefore each described by p features into a smaller number of groups. Furthermore, It creates a hierarchy of clusters that we can represent in a treelike diagram, called a dendrogram.
AHC uses a bottomup approach where each unit starts in its own cluster and merge pairs of clusters as you move up in the hierarchy. However, For conglomerative clustering, you can use agnes() function in R package or hclust() function in stats package.
Let us consider the example of European protein consumption in the context of hierarchical clustering.
Applying Hierarchical Clustering
1

foodagg=agnes(food,diss=FALSE,metric= "euclidian" ) 
Here argument dist=FALSE indicates that you have used dissimilarity matrix that is being calculated from raw data.
Argument metric=”euclidian” indicates that Euclidian distance is being used.
You can generate dendrogram as below:
1

plot(foodagg) 
A distance dP between two clusters is inversely proportional to the density in the middle of these two clusters. Here the assumption is that dP = ∞ if the two clusters are not adjacent.
© 2020 Data Science Central ® Powered by
Badges  Report an Issue  Privacy Policy  Terms of Service
Upcoming DSC Webinar
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 20082014  20152016  20172019  Book 1  Book 2  More
Upcoming DSC Webinar
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central