The question often comes up from folks starting to explore data science, just what is Machine Learning? When I started out it was easy to explain. Machine Learning (ML) was the category of mathematical algorithms like regression, clustering, decision trees, and neural nets used to extract signals from data, aka predictive models. Then came NoSQL and all that changed.
A short history review. The first commercial NoSQL implementation (not counting Google’s first mover efforts) are credited to Yahoo’s implementation of Hadoop in 2008 to improve their search indexing. The first Hadoop developers conference was mid-2008, and early implementation by Facebook, Twitter, and eBay occurred in 2009. This entire explosion in capability is now barely six or seven years old.
So what exactly has ML become? What’s in the box? How do they relate? In an effort to explain this recently, I put the components on this grid.
On the vertical I’ve put the major categories of data along with the NoSQL DB types that most commonly corresponds, and on the horizontal an indication of whether the insights gained from this ML type are specific or simply directional.
Predictive Models (Including descriptive or unsupervised models like clustering): These have always required numeric data (OK decision trees can take some text categoricals) and historically they drew on our traditional structured data from transactional systems or a data warehouse. The insight gained was intended to be quite accurate and if you’ll allow me a little artistic leeway, I’d say we expected these to be say 70% to 95% accurate in forecasting specific future values or human behaviors.
In a nod to NoSQL I let this category lap into Semi-Structured data territory where we can now extract specific features using JSON or XML coding to feed our models.
And while I’m not sure that “data lakes” rise to the level of an ML tool, the ability of NoSQL to create blended databases of semi-structured data that can handle all the volume, variety, and velocity of Big Data sources needs to be acknowledged.
Recommenders are those wonderful tools that tell us who to date or what to watch or read. Originally they were the exclusive domain of NoSQL Graph DBs but increasingly they’re being built with NoSQL Columnar DBs like Hbase. I don’t think anyone would argue they are meant to be anything more than directionally correct but their existence has undoubtedly added millions and millions of dollars in guided ecommerce buys and perhaps even some happy marriages.
Natural Language Processing (NLP):
NLP or text or sentiment analysis are techniques that take largely unstructured word BLOBs, for example from social media feeds or CSR logs and try to interpret general trends and sentiment. At this level of analysis the insights are clearly meant to be directional at best. But an interesting thing has happened. Some clever users (Bank of America comes to mind) have found signals in NLP data that can be converted to binary structured features and used alongside other structured data to enhance predictive models. For example, some text analysis patterns may be strongly correlated with buying or intent to defect and when those patterns are observed they can be converted to a feature in your purchase or churn model to enhance model accuracy.
Internet of Things (IOT):
Yes it was possible to analyze IOT data before NoSQL came along but not at the volume or velocity that NoSQL currently allows. In fact IOT data personifies the volume and velocity aspects of Big Data from the 10,000 sensors on the wings of an A320 to the SCADA output of your local nuclear reactor, to the predictive time to failure analysis on any variety of machines and vehicles, big and small.
The thing to understand about IOT and ML is that IOT (largely) numeric data streams are the input to predictive models that produce (relatively) accurate algorithms that can then be embedded in much simpler programs and sensors to predict outcomes. So while IOT data is frequently NoSQL Big Data, the analytics are typically old school numerically driven predictive models.
Image Processing and Deep Learning:
I have lumped these two seemingly dissimilar ML techniques together because they are both quite new and while their promise seems bright their adoption is still in its infancy. They promise to take unstructured data and turn it into very specific results, either directly (this is a picture of a cat) or in the style of IOT, as input to predictive models. Deep learning, which has been described as a sort of predictive modeling similar to ANN but unsupervised may one day overtake or replace some of traditional predictive modeling.
So in less than seven years, NoSQL has exploded the meaning of Machine Learning to include data lakes, recommenders, NLP, IOT, image processing, deep learning, and probably a couple I missed. Some of these give quite specific insights into the future and others are more directional, but valuably so for insights we couldn’t achieve before. There’s much more to data science and machine learning than there used to be.
April 24, 2015
Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2015, all rights reserved.
About the author: Bill Vorhies is President & Chief Data Scientist of Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at:
The original blog can be seen at: