How NoSQL Fundamentally Changed Machine Learning

Summary:  In just the six or seven short years since the first commercial implementation of a Hadoop NoSQL database Machine Learning has come to mean so much more than it did before.

The question often comes up from folks starting to explore data science, just what is Machine Learning?  When I started out it was easy to explain.  Machine Learning (ML) was the category of mathematical algorithms like regression, clustering, decision trees, and neural nets used to extract signals from data, aka predictive models.  Then came NoSQL and all that changed.

A short history review.  The first commercial NoSQL implementation (not counting Google’s first mover efforts) are credited to Yahoo’s implementation of Hadoop in 2008 to improve their search indexing.  The first Hadoop developers conference was mid-2008, and early implementation by Facebook, Twitter, and eBay occurred in 2009.  This entire explosion in capability is now barely six or seven years old.

So what exactly has ML become?  What’s in the box?  How do they relate?  In an effort to explain this recently, I put the components on this grid.

On the vertical I’ve put the major categories of data along with the NoSQL DB types that most commonly corresponds, and on the horizontal an indication of whether the insights gained from this ML type are specific or simply directional.

Predictive Models (Including descriptive or unsupervised models like clustering):  These have always required numeric data (OK decision trees can take some text categoricals) and historically they drew on our traditional structured data from transactional systems or a data warehouse.  The insight gained was intended to be quite accurate and if you’ll allow me a little artistic leeway, I’d say we expected these to be say 70% to 95% accurate in forecasting specific future values or human behaviors.

In a nod to NoSQL I let this category lap into Semi-Structured data territory where we can now extract specific features using JSON or XML coding to feed our models.

And while I’m not sure that “data lakes” rise to the level of an ML tool, the ability of NoSQL to create blended databases of semi-structured data that can handle all the volume, variety, and velocity of Big Data sources needs to be acknowledged.


Recommenders are those wonderful tools that tell us who to date or what to watch or read.  Originally they were the exclusive domain of NoSQL Graph DBs but increasingly they’re being built with NoSQL Columnar DBs like Hbase.  I don’t think anyone would argue they are meant to be anything more than directionally correct but their existence has undoubtedly added millions and millions of dollars in guided ecommerce buys and perhaps even some happy marriages.

Natural Language Processing (NLP):

NLP or text or sentiment analysis are techniques that take largely unstructured word BLOBs, for example from social media feeds or CSR logs and try to interpret general trends and sentiment.  At this level of analysis the insights are clearly meant to be directional at best.  But an interesting thing has happened.  Some clever users (Bank of America comes to mind) have found signals in NLP data that can be converted to binary structured features and used alongside other structured data to enhance predictive models.  For example, some text analysis patterns may be strongly correlated with buying or intent to defect and when those patterns are observed they can be converted to a feature in your purchase or churn model to enhance model accuracy.

Internet of Things (IOT):

Yes it was possible to analyze IOT data before NoSQL came along but not at the volume or velocity that NoSQL currently allows.  In fact IOT data personifies the volume and velocity aspects of Big Data from the 10,000 sensors on the wings of an A320 to the SCADA output of your local nuclear reactor, to the predictive time to failure analysis on any variety of machines and vehicles, big and small.

The thing to understand about IOT and ML is that IOT (largely) numeric data streams are the input to predictive models that produce (relatively) accurate algorithms that can then be embedded in much simpler programs and sensors to predict outcomes.  So while IOT data is frequently NoSQL Big Data, the analytics are typically old school numerically driven predictive models.

Image Processing and Deep Learning:

I have lumped these two seemingly dissimilar ML techniques together because they are both quite new and while their promise seems bright their adoption is still in its infancy.  They promise to take unstructured data and turn it into very specific results, either directly (this is a picture of a cat) or in the style of IOT, as input to predictive models.  Deep learning, which has been described as a sort of predictive modeling similar to ANN but unsupervised may one day overtake or replace some of traditional predictive modeling.

So in less than seven years, NoSQL has exploded the meaning of Machine Learning to include data lakes, recommenders, NLP, IOT, image processing, deep learning, and probably a couple I missed.  Some of these give quite specific insights into the future and others are more directional, but valuably so for insights we couldn’t achieve before.  There’s much more to data science and machine learning than there used to be.


April 24, 2015

Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2015, all rights reserved.


About the author:  Bill Vorhies is President & Chief Data Scientist of Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

[email protected]

The original blog can be seen at:


Views: 25874

Tags: IOT, NLP, deep, image, language, learning, modeling, natural, predictive, processing


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by ajit jaokar on July 30, 2015 at 9:10pm

Excellent resource. Thanks for posting it!

Comment by Robert Klein on June 2, 2015 at 11:57am

This is an extremely useful resource. As APIs come out and democratize these capabilities for more specific applications we’ll start to see ML take off in high gear. We just rolled out an API that correlates themes in streams of unstructured information over time in a way inspired by Lorenz’s work with closed systems. We’d appreciate feedback from devs using it for ML preprocessing: http://darwineco.com/builders. Hit us up with questions on github. Cheers!

Comment by Vasanth Gopal on May 21, 2015 at 3:46am

Very informative article. There is no doubt that it has stood the test of time. It is definitely proved itself as far as scalability and flexibility is concerned. It is less restricted and it is not handicapped with any particular data model like with SQL.

Comment by Ralph Winters on May 10, 2015 at 6:56am

No doubt that NoSql has added to our ability to capture and store the data. Unfortunately, I think the analytics or Machine Learning part is still behind in terms of the ability to directly analyze the large storage, and we still have to resort to extracting the data and processing it externally. The true analysis comes from probing the underlying data directly, which can be difficult.

Comment by Sione Palu on May 5, 2015 at 7:49am

While on Image Processing,  the following BBC mentioned Lena:


Comment by Sione Palu on May 5, 2015 at 7:39am

Great & Informative article Bill.

I would like to add on to the post. Image processing is a field that has existed on its own longer than machine learning (ie, it predates machine learning decades before), its been taught mainly as a branch of engineering (electrical & electronics) & to some lesser degree also taught in computer science & physics' courses.

Its only in the last decade or so, that image processing includes machine learning topics' for image recognition & understanding. The classic image processing textbook by Gonzalez, et al, "Digital Image Processing" (http://www.amazon.com/Digital-Image-Processing-3rd-Edition/dp/01316...) with the latest edition (3rd) which is still popular textbook today (the book has accompanying Matlab software) was first published in 1977 (1st edition).  The latest edition (3rd) has an added chapter on "Object Recognition" which wasn't available in the 1st & 2nd edition. The last time I passed through my local university bookstore (about a year ago), this textbook is stocked because its still currently a prescribed textbook for  final year Electrical engineering courses.

Anyway, the picture of Lena (taken in the early 1970s) is the most widely recognized image for Image Processing communities, since it is the first image that students are given to do their assignments or tests on.



© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service