Subscribe to DSC Newsletter

Lambda Architecture for Big Data Systems

Big data analytical ecosystem architecture is in early stages of development. Unlike traditional data warehouse / business intelligence (DW/BI) architecture which is designed for structured, internal data, big data systems work with raw unstructured and semi-structured data as well as internal and external data sources. Additionally, organizations may need both batch and (near) real-time data processing capabilities from big data systems.

Lambda architecture - developed by Nathan Marz - provides a clear set of architecture principles that allows both batch and real-time or stream data processing to work together while building immutability and recomputation into the system. Batch processes high volumes of data where a group of transactions is collected over a period of time. Data is collected, entered, processed and then batch results produced. Batch processing requires separate programs for input, process and output. An example is payroll and billing systems. In contrast, real-time data processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real-time). Customer services and bank ATMs are examples.

Lambda architecture has three (3) layers:

  • Batch Layer
  • Serving Layer
  • Speed Layer

Batch Layer (Apache Hadoop)

Hadoop is an open source platform for storing massive amounts of data. Lambda architecture provides "human fault-tolerance" which allows simple data deletion (to remedy human error) where the views are recomputed (immutability and recomputation).

The batch layer stores the master data set (HDFS) and computes arbitrary views (MapReduce). Computing views is continuous: new data is aggregated into views when recomputed during MapReduce iterations. Views are computed from the entire data set and the batch layer does not update views frequently resulting in latency.

Serving Layer (Real-time Queries)

The serving layer indexes and exposes precomputed views to be queried in ad hoc with low latency. Open source real-time Hadoop query implementations like Cloudera Impala, Hortonworks Stinger, Dremel (Apache Drill) and Spark Shark can query the views immediately. Hadoop can store and process large data sets and these tools can query data fast. At this time Spark Shark outperforms considering in-memory capabilities and has greater flexibility for Machine Learning functions.

Note that MapReduce is high latency and a speed layer is needed for real-time.

Speed Layer (Distributed Stream Processing)

The speed layer compensates for batch layer high latency by computing real-time views in distributed stream processing open source solutions like Storm and S4. They provide:

  • Stream processing
  • Distributed continuous computation
  • Fault tolerance
  • Modular design

In the speed layer real-time views are incremented when new data received. Lambda architecture provides "complexity isolation" where real-time views are transient and can be discarded allowing the most complex part to be moved into the layer with temporary results.

The decision to implement Lambda architecture depends on need for real-time data processing and human fault-tolerance. There are significant benefits from immutability and human fault-tolerance as well as precomputation and recomputation.

Lambda implementation issues include finding the talent to build a scalable batch processing layer. At this time there is a shortage of professionals with the expertise and experience to work with Hadoop, MapReduce, HDFS, HBase, Pig, Hive, Cascading, Scalding, Storm, Spark Shark and other new technologies.

Views: 16365

Tags: Architecture, Batch, Big, Data, Lambda, Layer, Serving, Speed, Systems

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Theodore Omtzigt on December 31, 2013 at 7:04am

I feel that we are just in the first phase on how to build distributed, scalable, big data architecture. Many of the core algorithms that create knowledge from raw data are based on constraint solvers, and the best known methods for these algorithms run between 50-100x SLOWER on MapReduce or Storm/S4. From a programming model, the MPMD (Multiple Program Multiple Data) form of MPI can absorb both at the cost of having to utilize more skilled programmers and/or longer development cycles; the key pain points of why distributed system design is being reinvented with MapReduce and streaming models. However, the 50-100x performance hit implies that these solutions are 50-100x MORE expensive from an execution point of view, so are very poor candidate for cloud computing where execution efficiency has an immediate cost impact. Similarly, if you already have 10,000 server farm, doubling your capacity would be more expensive than moving to a more efficient algorithm. All these constraints are slowly being felt by folks that have an economic incentive to solve them, and we already have a significant treasure trove of results in computer science that can point to 100x improvements, it is just a matter of finding the money to apply them. 

The combination of MapReduce and streaming computation are this first experiment. I feel that a better architecture is provided by the data fusion model, as computation (constraint solving) occurs in real-time at the point where data size constraints are prohibitive. As there are already a handful of experiments working on applying these techniques to different big data problems, I predict that there will be significant change happening in the next couple of years in the big data architecture space.

Comment by Michael Walker on December 16, 2013 at 10:24am

Jefferson: Great points. Depends on what you mean by "enterprise's information provision architecture". The traditional DW/BI architecture is necessary at this time to accurately record and distribute structured transactional data. Big data infrastructure architecture requires innovation and evolution before it can replace the traditional design. Yet I predict a paradigm shift in architectures will happen in the future to allow better integration between different data sources and structures.

Comment by Jefferson Lynch on December 10, 2013 at 5:24am

Hi Michael, I have a question regarding the "Serving Layer" in the above architecture.  I'm really interested to hear your opinion. 

At a seminar on Hadoop by IBM in October the presenter listed a comparison of Hadoop and RDBMS technologies which I found helpful.  Attributes compared included "Data Updates" (Only Inserts and Deletes vs. Updates too for RDBMS), "Data Integrity" (Data loss can sometimes happen and may be permissible in some situations, vs. Data loss is unacceptable for RDBMS), "Data Access" (Streaming access to files only, vs. Indexed random access for RDBMS), as well as many more; benefits were listed both ways, for the sake of argument I have just highlighted a few where RDBMS has some benefits over Hadoop.  There also seemed to be an acceptance that Hadoop was best suited to situations where long and often unpredictable latency was acceptable. 

So my question is: do you think just having a Hadoop HDFS capability for your batch layer is sufficient as an enterprise's information provision architecture?  On re-reading I see your article is headed "... for Big Data systems", so maybe you have in mind that the architecture you describe is supplemented by something else? 

Thanks, Jefferson

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service