Guest blog post by Gaurav Agrawal, COO at Soothsayer Analytics.
In today’s world of big data and the internet of things, it is common for a business to find itself sitting atop a mountain of data. Possessing it is one thing, but leveraging it for data driven decision making is a much different ball game. Gut-feelings and institutionalized heuristics have traditionally been used to guide development of protocol and decision making, but the world of artificial intelligence and big disparate data is changing that.
Everyone is trying to make sense of, and extract value from, their data. Those that are not will be left behind. This challenge (and opportunity) is not limited to certain industries. For instance, most companies are exploring how they can use data to make better marketing decisions, most retailers are using data to optimize their supply chains, and most manufacturers are using data for quality control of final products.
Almost all business problems (with surrounding data) can be broken down into two categories: supervised and unsupervised learning. Take for example facial recognition software. One method of recognizing faces is to train a program based off of a data set of pictures and associated tags. Tags may include “Face”, “Face, male”, or anything else. These tags allow algorithms to identify and learn what a face looks like and differentiate between male and female faces, or more granular subtleties if desired. This task can be reformulated as an unsupervised learning problem. The difference being tags in the supervised learning example are no longer present. Rather, the algorithm has to learn how to identify faces on its own. Technically speaking, the algorithm will not be able to identify faces as faces, but rather as sets of objects/images distinct from other objects/images. It is up to the user to tell the computer that those are faces it has identified. Google is an interesting example of how unsupervised learning was used to identify cats in YouTube videos (look here, or for a more technical treatment here).
Clustering (or segmentation) is a commonly encountered form of unsupervised learning in business. This involves grouping different data points (customers, products, movies, etc.) into clusters. Ideally each element contained in a cluster is similar to every other element in that cluster while being as different as possible from elements in other clusters. The goal of clustering is to minimize the difference between items in a cluster and maximize the difference between separate clusters.
Why is the ability to cluster well so important?
Clustering provides businesses the ability to achieve better results for initiatives and understand customers and processes at a much deeper level than a human can achieve alone. If you are a marketer, you may be interested in developing target marketing strategies. Before you do this, you must know who to market to. This can be accomplished through the grouping of customers based off of similar attributes of existing customers. This is a problem where clusters are determined by attributes used to define a customer: age, payment history, purchase history, etc.
Suppose you are a publishing firm and want to decide how to sell new books or determine how to reprice or market old books. Books can be grouped together using a clustering scheme based around the attributes of the books. These may include length, subject matter, reoccurring groups of words, etc. Clustering even pops up in insurance, city-planning, and identifying land usage. These can express themselves in identifying groups of insurance policy holders that have a higher than average claim cost, identifying groups of houses based around location, type and value, or identification of parcels of land around usage.
It is important that the original purpose of clustering is met in all of these examples: minimize differences between elements in a cluster while maximizing differences between clusters. Data complexity and algorithms in use today can make this a nontrivial problem. Basic algorithms often do not achieve desired results, so something more is needed. Below we will walk you through some common methods used for clustering and articulate the power that Soothsayer brings to the table.
Hierarchical clustering and k-means clustering are the two most basic and widely used methods of clustering.
Hierarchical clustering is based around organizing data points into a set of similar clusters, then recursively grouping clusters together until you are left with a single cluster. In essence, this algorithm assigns a hierarchy to data points. The benefit of this method is that it allows users to select the number of clusters they want and see the relationship between each cluster.
One major drawback of this method is the time it takes to run. Because the algorithm has to run through every data point and compare groups of data points to other groups of data points, the run time increases dramatically.
Example of Hierarchical Clustering
In k-means clustering, the number of desired clusters (k), are selected by the user. This algorithm starts by randomly selecting k points in the data space as the means of the k clusters. Each data point is assigned to a cluster with the closest mean, and then means are recalculated as the center of each cluster. Once this is done, these steps are repeated until there are no more changes in the means.
K-means clustering algorithm for k=2. Left: Beginning of algorithm, Right: End of algorithm
A similar method, k-medoids, uses data points themselves as cluster centers. Usually the algorithm progresses by randomly assigning data points as centroids, followed by assigning data points to the appropriate clusters. From there, centroids are randomly moved to a new point and kept if the sum of the distances to points in that cluster decreases.
Both of these methods allow the user to assign the number of clusters, k, to be found. Deciding upon the optimal number of clusters is done either by intuition or by analyzing variance between clusters and within clusters. While these methods work better for large numbers of data points than hierarchical clustering, they may fail when data is not separated well geometrically or if the data isn’t globular (making nice visual globules when plotted)
These methods often involve significant tinkering and patience, and they are often not enough for complex real world data sets. But there are other options.
Spectral Clustering – A More Advanced Approach
The aforementioned approaches are commonly used but limited in the complexity of data that they can cluster. Consider the data points below.
Spiral data sets are often used as benchmarks for clustering techniques, so one will be used here to illustrate differences. Clearly we have globular data in the set, however k-means is not able to discriminate between the three different arms of the spiral.
There exists a solution to this problem in the form of Spectral Clustering.
Spectral clustering has the advantage of functioning well on data that has high connectivity or similarity defined by the user. The basics of how it functions revolve around making what are called, similarity and degree matrices – a matrix describing how similar the data is to one another (geometric, nearest neighbor, etc.) and a matrix that is diagonal that measures the degree at each point. Think of this as the sum of the entries in the similarity matrix for each point, excluding that point. These two matrices define what is called a Laplacian matrix from which eigenvalues and eigenvectors are extracted. Think of these as values and vectors that describe the dimensionality of the matrix. The power of this algorithm comes from using the two lowest or highest (depending on how you organize things) eigenvectors to define a new space to describe your data in. This allows you to perform clustering, such as k-means, in this new space and separate out points into more refined groups to cluster. Think of this as redefining the way data is represented so that similar data points stick out more easily for more basic algorithms to detect. Let us reexamine our 2-d spiral data set using this algorithm.
Since the data points in each arm are closer to each other, they are nicely separated into the same clusters.
In the real world, data is often not easy to separate, and patterns are not usually obvious. Spectral clustering provides the needed firepower to find well-concealed patterns and cleanly segment data.
Traditional methods, such as k-means, use distances between data points to group items into sets. Spectral clustering uses features extracted from data to increase ease and accuracy of clustering. This allows for more accurate grouping and better accounting for outliers and hard to classify data.
In modern business, any improvement in accuracy or understanding can mean the difference between success and failure. It is important for businesses to have the right ammunition to tackle ever increasing complexity and to get the most out of their data.
About the Author:
Gaurav Agrawal is COO at Soothsayer Analytics. He has been involved with Business Development & Project Management activities for over 20 years, and has experience in independently handling the management of the business. Gaurav has vast experience as a software consultant in analysis, design, development, implementation and deployment of enterprise business applications for Fortune 500 clients, having worked for clients in United States of America, United Kingdom, Germany & Asia-Pacific. My responsibilities include defining and executing the company’s long-term strategy focused on growth, profitability, customer leadership and operational excellence.