Lambda architecture

This is an old revision of this page, as edited by Textractor (talk | contribs) at 23:54, 25 August 2014 (add info on data structure to intro; expand layer sections). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.


Lambda Architecture

Lambda architecture refers to a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate precomputed views, while simultaneously using real-time stream processing to provide dynamic views. The two view outputs may be joined before before presentation. The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce.[1]

Lambda architecture depends on a data model with an append-only, immutable data source that serves as a system of record. It is intended for ingesting and processing timestamped events that are appended to existing events rather than overwriting them. State is determined from the natural time-based ordering of the data.

Overview

Lambda architecture describes a system consisting of three layers: batch processing, speed (or real-time processing), and a serving layer for responding to queries.[2] The processing layers ingest from a master copy of the entire data set and acts as the system of record.

Batch Layer

The batch layer precomputes results using a distributed processing system that can handle very large quantities of data. This layer aims at providing views based on processing all available data to provide the most accurate views. Apache Hadoop is the de-facto batch-processing system used in most high-throughput architectures,[3] including lambda. Output is typically stored in a read-only database, with updates completely replacing existing precomputed views.

Speed Layer

The speed layer processes data streams in real time and without the requirements of fix-ups or completeness. This layer sacrifices throughput as it aims to minimize latency by providing real-time views into the most recent data. Stream-processing technologies typically used in this layer include Apache Storm and Apache Spark. Output is typically stored on fast NoSQL databases, allowing for dynamic and correctable views.

Serving Layer

Output from the batch and speed layers are stored in the serving layer, which responds to ad-hoc queries by building views from the processed data. Druid provides a single cluster to handle output from both layers. Examples of dedicated stores used in the serving layer include Apache Cassandra or Apache HBase for the speed layer, and Elephant DB or Cloudera Impala for the batch layer.


Relies on a combination of computation techniques such as partial recomputation (p. 287) and estimation (hyperloglog), as well as optimizations in resource usage (p. 293) and data transformations. Recomputation is required for fault tolerance, while incremental computation algorithms may be selectively added to increase efficiency.


Lambda Architecture in Use

In practice, each of the three layers can be built from any of a number of suitable components. However,

For the serving layer, some implementations have used Cassandra to store and serve views from the speed layer, and Elephant DB to store and serve views from the batch layer.[4] Other implementations have used Apache HBase for serving views from the speed layer, and Cloudera's Impala for serving views from the batch layer.[5]

Metamarkets, which provides analytics for players in the programmatic advertising space, employs a version of the lambda architecture that uses Druid (open-source data store) for storing and serving both the streamed and batch-processed data.[6] For running analytics on its advertising data warehouse, Yahoo has taken a similar approach, also using Apache Storm, Hadoop, and Druid.[7]

The Netflix Suro project has separate processing paths for data, but does not strictly follow lambda architecture since the paths may be intended to serve different purposes and not to provide the same type of views.[8] Nevertheless, the overall idea is to make selected real-time event data available to queries with very low latency, while the entire data set is also processed via a batch pipeline. The latter is intended for applications that are less sensitive to latency and require a map-reduce type of processing.

Criticism

Criticism of lambda architecture has focused on its inherent complexity and its limiting influence. The batch and streaming sides each require a different code base that must be maintained and kept in sync so that processed data produces the same result from both paths, while attempting to abstract the code bases into a single framework puts many of the specialized tools in each side's ecosystems out of reach.[9]

In a technical discussion over the merits of employing a pure streaming approach, it was noted that using a flexible streaming framework such as Apache Samza could provide some of the same benefits as batch processing without the latency.[10] Such a streaming framework could allow for collecting and processing arbitrarily large windows of data, accommodate blocking, and handle state.

References

  1. ^ Schuster, Werner. "Nathan Marz on Storm, Immutability in the Lambda Architecture, Clojure". www.infoq.com. Interview with Nathan Marz, 6 April 2014
  2. ^ Marz, Nathan; Warren, James. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications, 2013, p. 13.
  3. ^ Kar, Saroj. "Hadoop Sector will Have Annual Growth of 58% for 2013-2020", 28 May 2014. Cloud Times.
  4. ^ Bijnens, Nathan. "A real-time architecture using Hadoop and Storm". 11 December 2013, slide 24
  5. ^ http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for-architecting
  6. ^ Yang, Fangjin, and Merlino, Gian. "Real-time Analytics with Open Source Technologies". 30 July 2014, slide 42
  7. ^ Rao, Supreeth, and Gupta, Sunil. "Interactive Analytics in Human Time". 17 June 2014, slides 9 and 16
  8. ^ Bae, Jae Hyeon; Yuan, Danny; Tonse, Sudhir. "Announcing Suro: Backbone of Netflix's Data Pipeline", Netflix, 9 December 2013
  9. ^ Krebs, Jay. "Questioning the Lambda Architecure". radar.oreilly.com. Oreilly. Retrieved 15 August 2014.
  10. ^ Hacker News retrieved 20 August 2014