This year has seen consolidation and engineering around improving the basic storage and data processing engines of NoSQL and Hadoop. That will doubtless continue, as we see the unruly menagerie of the Hadoop universe increasingly packaged into distributions, appliances and on-demand cloud services. Hopefully it won’t be long before that’s dull, yet necessary, infrastructure.
Looking up the stack, there’s already a line up of cool tools for data scientists and not to be left out, the Hadoop connectors for established analytical tools such as R and Tableau. Yet the idea is to go into next year making big data more powerful: frankly by to decreasing the cost of generating experiments.
Here are two ways in which big data can be made more powerful.
Hadoop’s batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn’t need to be up-to-the-minute. However, batch processing isn’t always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising.
Over the next few years we’ll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.
For some applications, there just isn’t enough storage in the world to store every piece of data your business might receive: at some point you need to make a decision to throw things away. Having streaming computation abilities enables you to analyze data or make decisions about discarding it without having to go through the store-compute loop of map/reduce.
Emerging contenders in the real-time framework category include Storm, from Twitter, and S4, from Yahoo.
Rise of data marketplaces
Your own data can become that much more potent when mixed with other datasets. For instance, add in weather conditions to your customer data, and discover if there are weather related patterns to your customers’ purchasing patterns. Acquiring these datasets can be a pain, especially if you want to do it outside of the IT department, and with some exactness. The value of data marketplaces is in providing a directory to this data, as well as streamlined, standardized methods of delivering it. Microsoft’s direction of integrating its Azure marketplace right into analytical tools foreshadows the coming convenience of access to data.
Development of data science workflows and tools
As Data Scientist and their teams become a recognized part of companies, we’ll see a more regularized expectation of their roles and processes. One of the driving attributes of a successful data science team is its level of integration into a company’s business operations, as opposed to being a sidecar analysis team.
Software developers already have a wealth of infrastructure that is both logistical and social, including wikis and source control, along with tools that expose their process and requirements to business owners. Integrated data science teams will need their own versions of these tools to collaborate effectively. One example of this is EMC Greenplum’s Chorus, which provides a social software platform for data science. In turn, use of these tools will support the emergence of data science process within organizations.
Data science teams will start to evolve repeatable processes, hopefully agile ones. They could do worse than to look at the ground-breaking work newspaper data teams are doing at news organizations such as The Guardian and New York Times: given short timescales these teams take data from raw form to a finished product, working hand-in-hand with the journalist.
Increased understanding of and demand for visualization
Visualization fulfills two purposes in a data workflow: explanation and exploration. While business people might think of a visualization as the end result, data scientists also use visualization as a way of looking for questions to ask and discovering new features of a dataset.
If becoming a data-driven organization is about fostering a better feel for data among all employees, visualization plays a vital role in delivering data manipulation abilities to those without direct programming or statistical skills.