The Unix Philosophy, summarized by Doug McIlroy in 1994:
Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
This is considered by some to be the greatest revelation in the history of computer science, and there’s no debate that this philosophy has been instrumental in the success of Unix and its derivatives. Beyond Unix, it’s easy to see how this philosophy has been fundamental to the fantastic success of the Hadoop ecosystem.
Nevertheless, a recurring problem in Computer Science is latching on to a great idea and applying it indiscriminately, even when it doesn’t make sense. I’m going to argue that while the Unix philosophy works for batch workflows, it is a poor fit for stream processing.
In 2008, when I started work on what would become VoltDB, there was a tremendous amount of buzz around Hadoop. At the time Hadoop and “MapReduce” were almost synonymous. Seven years later, “MapReduce” is being subsumed by better technologies, but the Hadoop ecosystem is strong.
If MapReduce is waning, what has made the Hadoop ecosystem so successful? I’m probably not going out on a limb to suggest that HDFS is the key innovation of Hadoop. HDFS jump-started the Cambrian explosion of batch-processing we’re now experiencing by doing three things:
These three traits are why HDFS enables so many Hadoop ecosystem tools to do so many different things.
Between HDFS replication and data set immutability, no matter how flawed your software or hardware consuming the data is, the original data is safe. A successful batch job takes one immutable data set as input and produces another as output. If a job fails or corrupts due to hardware or software problems, the job can be killed, the output deleted, the problem addressed and then the job restarted. The input data is still pristine. HDFS and the batch nature of the work allows this robustness. Contrast this with a system that modifies a dataset in place, and you see that immutability mostly absolves developers from managing partial failure.
So developers building Hadoop ecosystem tools to run on HDFS are absolved from the most difficult problems of distributed systems. This enables new batch-oriented tools to be developed quickly. End-users are free to adopt new tools aggressively, as they can be run in addition to an already-working system without introducing new risk.
HDFS enables a vast ecosystem of tools that can be combined into billions of different combinations, and adjusted as business needs change.
Unsurprisingly, not all apps are a great fit for batch. Many data sets have value that must be extracted immediately or opportunities and insight are lost. We call these apps “Fast Data” apps, as opposed to “Big Data” apps.
Unfortunately, low-latency and HDFS don’t mix. HDFS isn’t magical and, like all things in engineering, comes with tradeoffs. Almost all of the tools built around HDFS as a repository for data accept that the latency between ingestion and the desired end response is measured in minutes or hours, not seconds or milliseconds.
This is typically because input datasets must be loaded before processing, and the output result must be fully written before it is accessed. This process can be chunked to reduce latency, but in order to maintain efficiency, these chunks are still at least seconds’ worth of work to process. The second latency issue is that most batch systems favor retry over redundancy when handling faults. This means, in the case of hardware or software failure, users will see significant spikes in latency.
Thus we’ve ruled out HDFS for latency-sensitive processing problems. So what do we use instead? Surely, without HDFS as the unifying foundation, the pattern of cobbling together many special-purpose systems to build a larger system using glue code and luck won’t translate to Fast Data, right?
You may have guessed this was a rhetorical question.
Let’s examine Storm, an exemplar tool for stream processing. Storm offers a framework for processing tuples. A set of processors, each with inputs and outputs, is connected in a topology. The user supplies the per-tuple processing code and the framework manages where that code is run, much like a MapReduce Hadoop job. Storm also manages the optional re-processing of tuples when software or hardware fails. Storm has a real “MapReduce” for streams feel to it, but there is a crucial difference: Storm is responsible for the data it is processing, where MapReduce is processing data kept safe by HDFS.
Keeping with the Unix and Hadoop philosophy of specialized tools, Storm focuses on distributing data and processing to a commodity cluster of Storm workers. It leaves other key functions to other systems.
A recent talk from a Twitter engineer described a problem his team solves with Storm. They count unique users of mobile apps for given time periods. The tricky part is the volume: reportedly 800,000 messages per second, making this a poor fit for more traditional systems. The stack they use involves Kafka, Storm, Cassandra, and of course, ZooKeeper.
We can assume ZooKeeper, Kafka, Storm and Cassandra all use at least three nodes each to run with reasonable safety. Three is a bit of a magic number in distributed systems; two node clusters have a harder time agreeing on the state of the cluster in failure scenarios. So now we’re operating and monitoring four systems, at least twelve nodes, and the interops/glue between all of the above. If a network between two Cassandra nodes fails, are the symptoms the same as if the network between Storm nodes failed, or ZooKeeper nodes failed? Each of these systems has different failure semantics, and can cause different symptoms or cascading failures in other parts of the stack.
While four systems wouldn’t be odd in an HDFS-based batch system, the crucial difference here is that user data ownership is passed between systems, rather than being wholly the responsibility of HDFS. Sometimes state only has meaning if data from multiple systems are combined. Development and testing stacks with four systems isn’t 33% harder than stacks with three systems either; it can be as much as four times as hard, depending on how the systems connect with each other.
Builders of such systems readily admit this, but some of them have found a “solution”.
So batch is too slow, and Fast Data offerings are less than robust and a bit too opaque. What to do?
Nathan Marz, formerly of Twitter and creator of the Apache Storm project, proposed the Lambda Architecture as the solution. Rather than address the flaws directly, you simply run both the batch and streaming systems in parallel. Lambda refers to the two systems as the “Speed Layer” and the “Batch Layer”. The Speed Layer can serve responses in seconds or milliseconds. The Batch Layer can be both a long-term record of historical data as well as a backup and consistency check for the speed layer. Proponents also argue that engineering work is easier to divide between teams when there are two discrete data paths. It’s easy to see the appeal.
But there’s no getting around the complexity of a Lambda solution. Running both layers in parallel, doing the same work, may add some redundancy, but it’s also adding more software, more hardware and more places where two systems need to be glued together. It’s this lack of natural integration that makes Lambda systems so difficult. The fact that the work can be divided amongst teams is itself a tacit acknowledgement that Lambda solutions may require “teams,” plural.
So why is the Lambda Architecture gaining acceptance? I believe there are two ways to answer. First, it’s because Lambda has some real advantages. For starters, it can scale. The number one rule of 21st century data management: If a problem can be solved with an instance of MySQL, it’s going to be. With Fast Data, we’re often talking about hundreds of thousands of events per second to process, something well beyond what MySQL can handle. There is a set of Lambda proponents who know that Lambda is complicated and operationally challenging, but are aware of few other options.
The alternative reason people turn to the Lambda Architecture is habit. People who work with Big Data are used to going to the Apache Software Foundation’s website, picking from a menu of software products, and connecting them up. Some people just want to wire things together from smaller cogs, whether it’s the easiest path or not.
Many voice similar arguments about the complexity and cost of typical Lambda solutions. The best way to back up such an argument is by pointing to an alternative. And the way to build a successful alternative to Lambda seems to be to improve the efficiency and robustness of the speed layer.
LinkedIn/Apache has Samza. Spark has a streaming flavor. Google has Millwheel. All of these make arguments that the speed layer can be less burdensome on a developer. Perhaps you’re not aware of VoltDB’s pitch as a Fast Data engine, but we feel that the design of the VoltDB system makes it one of the best possible approaches to this problem.
Read full article, discussing VoltDB in details.