In this two-part series, we will explore text clustering and how to get insights from unstructured data. It will be quite powerful and industrial strength. The first part will focus on the motivation. The second part will be about implementation.
This post is the first part of the two-part series on how to get insights from unstructured data using text clustering. We will build this in a very modular way so that it can be applied to any dataset. Moreover, we will also focus on exposing the functionalities as an API so that it can serve as a plug and play model without any disruptions to the existing systems.
- Text Clustering: How to get quick insights from Unstructured Data – Part 1: The Motivation
- Text Clustering: How to get quick insights from Unstructured Data – Part 2: The Implementation
In case you are in a hurry you can find the full code for the project at my Github Page
Just a sneak peek into how the final output is going to look like –
It is established beyond reasonable doubt that data is the new oil. Organizations across the globe are aggressively building in-house analytics capabilities to harness this untapped treasure cove. However sustainable business benefits arising from analytics initiatives remain elusive at large as organizations are yet to discover the secret recipe that makes it all work.
As per a recent study, the average ROI from analytics initiatives is still negative for most organizations. Most organizations are in one of the following stages of evolution towards becoming a data driven organization –
Dealing with Unstructured Data
Organizations today are sitting on vast heaps of data and unfortunately, most of it is unstructured in nature. There is an abundance of data in the form of free flow text residing in our data repositories.
While there are many analytical techniques in place that help process and analyze structured (i.e. numeric) data, fewer techniques exist that are targeted towards analyzing natural language data.
In order to overcome these problems, we will devise an unsupervised text clustering approach that enables business to programmatically bin this data. These bins themselves are programmatically generated based on the algorithm’s understanding of the data. This would help tone down the volume of the data and understanding the broader spectrum effortlessly. So instead of trying to understand millions of rows, it just makes sense to understand the top keywords in about 50 clusters.
Based on this, a world of opportunities open up –
- In a customer support module, these clusters help identify the show stopper issues and can become subjects of increased focus or automation.
- Customer reviews on a particular product or brand can be summarized which will literally lay the road map for the organization
- Surveys data can be easily segmented
- Resumes and other unstructured data in the HR world can be effortlessly looked at…….
This list is endless but the point of focus is a generic machine learning algorithm that can help derive insights in an amenable form from large parts of unstructured text.
Text Clustering: Some Theory
The algorithm first performs a series of transformations on the free flow text data (elaborated in subsequent sections) and then performs a k-means clustering on the vectorized form of the transformed data. Subsequently, the algorithm creates cluster-wise tags, also known as cluster-centers, that are representative of the data contained in these clusters.
The solution boasts of end-to-end automation and is generic enough to operate on any dataset.
The text clustering algorithm works in five stages enumerated below:-
- Transformations on raw stream of free flow text
- Creation of Term Document Matrix
- TF-IDF (Term Frequency – Inverse Document Frequency) Normalization
- K-Means Clustering using Euclidean Distances
- Auto-Tagging based on Cluster Centers
These are elaborated below along with illustrations:-
The free flow text data is first curated in the following stages:-
- Stage 1
- Removing punctuations
- Transforming to lower case
- Grammatically tagging sentences and removing pre-identified stop phrases (Chunking)
- Removing numbers from the document
- Stripping any excess white spaces
- Stage 2
- Removing generic words of the English language viz. determiners, articles, conjunctions and other parts of speech.
- Stage 3
- Document Stemming which reduces each word to its root using Porter’s stemming algorithm.
These steps are best explained through the illustration below:-
Once all the documents in the corpus are transformed as explained above, a term document matrix is created and the documents are transformed into this vector space model using the 1-gram vectorizer (see below). Other more sophisticated implementations include n-gram (where n in a reasonably small integer)
TF-IDF (Term Frequency – Inverse Document Frequency) Normalization
This is an optional step and can be performed in case there is high variability in the document corpus and the number of documents in the corpus is extremely large (of the order of several million). This normalization increases the importance of terms that appear multiple times in the same document while decreasing the importance of terms that appear in many documents (which would mostly be generic terms). The term weightages are computed as follows:-
K-Means Clustering using Euclidean Distances
Post the TF-IDF transformation, the document vectors are put through a K-Means clustering algorithm which computes the Euclidean Distances amongst these documents and clusters nearby documents together.
Auto-Tagging based on Cluster Centers
The algorithm then generates cluster tags, known as cluster centers which represent the documents contained in these clusters. The clustering and auto-generated tags are best depicted in the illustration below (Principal components 1 and 2 are plotted along the x and y axes respectively):-
In order for more and more users to benefit from this solution and analyze their unstructured text data, I have created a RESTful web service that users can access in two ways:-
- A web interface for this service which is a Swagger API Docs front end. This is a very popular solution for RESTful web services. The user can navigate to the web interface URL, upload the data-set, specify the column containing the natural language data that needs to be analyzed and the desired number of clusters and within a few minutes the output will appear as a downloadable link containing the results of the analysis.
- Since the web service works on the concept of Application Programming Interface (API), the computation engine that performs the analysis is a separate component which is scalable, portable and can be accessed from any other application through RESTful HTTP.
Since all computations are performed in-memory, the results are lightning fast.
A mathematical approach to understanding and analyzing natural language data could prove instrumental in unlocking the enormous value and insights currently trapped within it and vastly improve our understanding of our organization and its eco-system. The next post will contain the ground-level implementation details. Follow along with me if you are interested and this will work out great. My next post on the tech details will be up soon. The code is available at my Github Page