Big Data is probably one of the most misused words of the last decade. It was widely promoted, discussed, and spread around by business managers, technical experts, and experienced academics. Slogans like “Data is the new oil” were widely accepted as unquestionable truth.
These beliefs pushed technologies forward. Its stack, formerly developed by Yahoo! and now owned by the Apache Software Foundation, was recognized as “The” Big Data solution.
Many companies started to offer commercial, enterprise-grade and supported versions of Hadoop until it started to be experimented and adopted on by a large number of industries, ranging from medium-sized companies to Fortune 500.
The possibility to analyze huge amounts of data generated by heterogeneous sources, trying to boost the company’s competitiveness and profitability, was the key reason for the investments made on Hadoop.
Another important point was the idea to replace the expensive legacy data warehouse installations with Hadoop, trying to improve both performances and data availability while reducing operational costs at the same time.
However, during the last few years, a growing number of analysts focused on the Big Data market, started to publish articles, declaring. Their mainly motivation behind such statements can be summarized as follows:
- The deployment model is moving from on-premises solutions to hybrid, full, and multi-cloud architectures. Hadoop isn’t a technology made to be completely cloud-ready. Furthermore, cloud vendors have already been selling cheaper and easy to manage and use solutions for years.
- Machine Learning technologies and platforms are quickly reaching a productiveness maturity. The Hadoop stack was not ideated around Machine Learning concepts even if support for it has been added during the years.
- The advanced and real-time analytics market is rapidly increasing. The Hadoop stack doesn’t seem to be the best fit to implement those innovative kinds of analytics.
In a few words, the analysts started to declaring that Hadoop was not anymore an innovative technology and to solve the future challenges something different needed to be put on the plate.
On the opposite side, from a more empirical point of view, analyzing our personal past experiences, the solutions based on the Hadoop stack were proven to be really hard and expensive to be developed and maintained. Furthermore, the professionals having the right skills and any proven experience were not easy to be recruited.
As a result, a lot of adopters finally didn’t reach the maturity of its vertical solution developed on top of the technology. The consequence was that moving those systems from PoC and prototypes statuses to real productiveness seemed almost an unreachable finish line.
Those aren’t the unique key reasons for the recent disillusion around Hadoop technologies and in general on the “Big Data” movement. One of the main motivations can be identified on the proposition used by many Hadoop vendors, positioning concepts like as central to data management.
While creating a unique, denormalized data repository is still a need for big and complex organizations at least to empower data governance and data lineage practices, the projects aiming to feed Data Lakes usually tend to last years in large enterprise before reaching the maturity. Most of those initiatives finally demonstrated to be really expensive both from an economic and project governance point of view.
These complex repositories are filled with historical data referring, in the lucky case, to a series of snapshots of the last closing day. While that could be acceptable in a lot of business scenarios, the enterprise world is needing to instantaneously react to the happenings. For this reason, companies are increasingly asking for more accurate and rapid insight to immediately forecast the possible outcomes and scenarios generated by the available set of input actions.
Nowadays, one of the best available ways to solve effectively these pressing requirements is to embrace an architecture. Coming back also to the points raised by various analysts, it’s clear that the Event Stream Processing could become a perfect backbone for at least:
- The implementation of multi-cloud architectures (real-time or near-real-time integration of distributed data across different data centers and cloud vendors).
- The deployment and monitoring of Machine Learning models, enjoying the power of real-time predictions.
- Real-time data processing without losing accuracy while analyzing historical data.
For these reasons streaming technologies are improving every day, eating market shares to more canonical solutions working with batch processing.
The Hadoop vendors majority decided to answer to these pressing needs by incorporating into their Big Data Distribution, one of the streaming frameworks offered by the open-source landscape. The selected solutions were normally Apache Storm or Apache Spark Streaming.
Unfortunately, the consequence was to finally add even more complexity into their stack; the offered products eventually included a wide number of computational engines, making the choice of the right tool for the job painful for operative figures like architects and developers.
Other vendors are trying instead to follow new ways of dealing with the combination of bounded (e.g. a file) and unbounded (e.g. an infinite incoming sequence of tweets) kind of data sources through the employment of stream engines also for batch processing.
Which is the relationship between stream and batch processing? While it’s almost impossible to run a stream processing job on top of a batch processing framework, the opposite is largely feasible. For instance, we can read a text file using a stream processing framework, translating each file row into a single event and processing it. On the other side, a batch processing framework cannot work on every single event although it can process a set of events: to reach a similar result it has to be continuously scheduled.
To summarize, stream processing can be identified as a superset of batch processing. Therefore, batch processing can be recognized as a special case of stream processing.
In conclusion, the usage of an Event Stream Processing engine can:
- Work on both bound data (data on-rest) and unbound data (data on-motion).
- Process data offering a tunable low latency (ranging from milliseconds to seconds) still acting with high throughput.
- Offer different processing semantics (at-most-once, at-least-once or exactly-one).
- Process heterogeneous data in a distributed fashion scaling out the systems horizontally.
However, there is a dark side in all of this; it’s not as easy as it might seem to architect and develop solutions based on stream processing. Although such technologies are lightweight, and they usually need a less complex stack, they are not straightforward to be used in the right way at first.
Instead, considering its importance and benefits, Event Stream Processing should be democratized by tackling the impediments with the use of high-level self-service tools enforcing best practices and patterns by leveraging the Big Data stacks often already present in the companies and trying to preserve the investments made in the past.