Making Do With Small Data
The 2010s could, arguably, be described as the era of Big Data, where all of a sudden it seemed like businesses were being deluged by huge amounts of data that had to be processed immediately. Part of this was an amplification of the IT hype mills, as Big Data required Big Servers (or lots of little ones), faster processors, and more programmers to do the heavy lifting of creating the Data Lakes and Enterprise Warehouses that were so integral to the zeitgeist, and part of it was the impact of mobile computing as it suddenly expanded the number of sensors in play dramatically.
Yet the reality on the ground was a bit different for most companies, even many in the IT space itself. Most of the really big data was coming from a few focused social media companies, not from business dramatically increasingly data streams elsewhere, and much of that (most of that) was noise outside of the context that it has come from. Social media is actually a poor place to pick up on covert terrorist activities (high noise, subtle signals), though it's great in identifying domestic terrorists who want to publicly high-five themselves with their buddies over their latest hijinks.
Most data is, at the end of the day, the trail that transactions leave over time. This information can be valuable, but from the perspective of a business, the metadata at the other end of the transaction is usually fragmentary and hard to quantify. This is one of the reasons that any comprehensive AI solution has to incorporate both algorithmic processes (machine learning) and annotational processes (semantics). Most analytics tools, even neural networks, tend to concentrate on data from the perspective of the transaction, while annotational processes are often far more useful to a company as it is a critical source for what is colloquially called "labeling".
Labeling is often considered bothersome by analysts because it is time-consuming and requires the collection of metadata rather than the analysis of data. This data also requires developing a conceptual model and the distillation of relationships that usually does require human intervention. It is possible to infer this data using statistical techniques, but it requires a huge amount of data to do so, while at the same time providing at best only a hint of that underlying structure.
The next generation of neural networks is beginning to take this small data into account, in essence focusing increasingly on not just the statistics of the data but also its shape. Known as labeled neural networks (LNN) or graph neural networks (GNN), these various convolutional neural nets replace brute force analysis with what amount to Bayesian networks. These use probabilistic models to identify the schema (or model) implicit in the data. With that information (especially when combined with the contextual streaming that provides the working memory for these processes), GNNs can then become self-labeling, determining not only value but also structure to the resulting function.
The biggest benefit of this technology will be in the areas of making it possible to get the benefits of big data systems without requiring big data. Put another way, artificial intelligence is becoming more intuitive, able to parse out valid patterns with far less raw input. By being able to make do with such small data, all users should be a benefit from this technology, not simply the ones with the deepest pockets.
In media res,
To subscribe to the DSC Newsletter, go to Data Science Central and become a member today. It's free!
Data Science Central Editorial Calendar
DSC is looking for editorial content specifically in these areas for September, with these topics having higher priority than other incoming articles.
DSC Featured Articles
Picture of the Week
To make sure you keep getting these emails, please add [email protected] to your browser's address book.
Join Data Science Central | Comprehensive Repository of Data Science and ML Resources
Videos | Search DSC | Post a Blog | Ask a Question
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge
This email, and all related content, is published by Data Science Central, a division of TechTarget, Inc.
275 Grove Street, Newton, Massachusetts, 02466 US