AI for Organizing – Categorizing and Managing Data at Scale

So far in this series on Artificial Intelligence Artificial Intelligence: Monster or Mentor? and AI for Summarization – Enabling Human-Consumable Information, we looked at several key ways in which AI advances can improve human productivity in organizations. Last time’s article dove into Distillation – automating the path to value. In this article, we’ll look at the next common approach: Categorization.

Categorization is applying AI approaches to automate the labeling and organization of large data volumes, so that data can be routed, processed, and interpreted in the right way. Imagine an enormous coin sorter that takes dump truck loads of coins (mixture of currencies across the globe) and produces nicely sorted buckets of quarters, nickels, etc. for each currency. This is a poster-child example of categorization – the categories are well understood (we know up front all the possible “buckets” that coins can land in), and the sorter sorts accurately into categories. In many real-life applications, and especially when we are categorizing large volumes of data, we aren’t this lucky. We (a) might not know what the “buckets” should be, and we (b) often make mistakes in categorization. 

Topic Modelling is a great example of a machine learning approach to this challenge. 

Imagine you had every article ever written in the New York Times … but you didn’t know which section the article came from (Business? Sports? Opinions?). By applying Topic Modelling methods, such as Latent Dirichlet Allocation (LDA), the algorithms can learn and infer naturally occurring “buckets” of articles—or, as we call them in this case, topics (see https://en.wikipedia.org/wiki/Topic_model). 

This is a standard “hello world” example, but the approach is immensely useful in business. For example, these techniques can enable large-scale, minimally supervised categorization of inbound customer emails. Many organizations receive more email than their sorting/routing teams can handle, and in today’s instant social media world, it’s crucially important to not drop the ball on important customer communication. By applying these AI techniques, the algorithms can learn and identify the inherent structure in the correspondence, and with minimal human intervention help route volumes of email to the correct teams. 

Optimizing outbound customer communication is another good example of how categorization can have a large impact on business. Customer outreach and marketing campaigns are frequently plagued by low conversion rates—too few customers click or respond to the emails/ads/etc. they are sent. There’s many factors at play, but a significant one is how tailored the email is to their interests and needs. A generic listing of items on sale doesn’t create the same interest as a customized list based on their interests. 

This approach, often called market segmentation, enables organizations to identify groupings of their customers with shared interests and then customize email communications for each segment. Market segmentation leads to increased conversion rates and a better experience for the customer. 

Categorization has numerous other applications, but its impact is frequently greatest when a human is in the loop, as they gain most from having some inherent structure or organization (categories, if you will) applied to the enormous data volumes they’re trying to understand. Typically, this doesn’t solve the whole problem (e.g., in the newspaper article case, someone still needs to say these articles are “Sports”) but it makes the human drastically more efficient and productive.

The next article will wrap up this series, and we’ll take a look at how AI techniques for Prediction can enable humans to more efficiently find the figurative “needles in the haystack”.


Roy Wilds is the Chief Data Scientist at PHEMI Systems, a big data warehouse solutions company.