Lambda architecture

This is an old revision of this page, as edited by Textractor (talk | contribs) at 19:59, 18 August 2014 (Examples of Lambda Architecture in Use). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.


Lambda Architecture

Lambda architecture refers to a data-processing architecture aimed at processing massive quantities of data while allowing ad-hoc queries and lowering the latency of those queries. Lambda architecture attempts to solve the problem of balancing comprehensiveness (including all data), accuracy, and latency when querying big-data collections.

Lambda architecture describes a system consisting of three layers:[1]

  • Batch – Precomputes results using a distributed processing system, typically Hadoop. This layer stores a master copy of the entire data set and acts as the system of record.
  • Serving – Responds to ad-hoc queries by gathering data from the batch layer, or, if unavailable, the Speed layer.
  • Speed – Processes data streams without regard to fix-ups or completeness.

Relies on a combination of computation techniques such as partial recomputation (p. 287) and estimation (hyperloglog), as well as optimizations in resource usage (p. 293) and data transformations.

Criticism of lambda architecture has focused on its inherent complexity and its limiting influence. The batch and streaming sides each require a different code base that must be maintained and kept in sync so that processed data produces the same result from both paths, while attempting to abstract the code bases into a single framework puts many of the specialized tools in each sides ecosystems out of reach.[2]

In practice, each of the three layers can be built from any of a number of suitable components. For the serving layer, some implementations have used Cassandra to store data from the speed layer, and Elephant DB to do the same for the batch layer.[3]

Examples of Lambda Architecture in Use

Metamarkets employs a version of the lambda architecture that uses Druid (open-source data store) for storing and serving both the streamed and batch-processed data.[4] For running analytics on its advertising data wharehouse, Yahoo has taken a similar approach, also using Apache Storm, Hadoop, and Druid.[5]

References

  1. ^ Marz, Nathan, and Warren, James. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications, 2013, p. 13.
  2. ^ Krebs, Jay. "Questioning the Lambda Architecure". radar.oreilly.com. Oreilly. Retrieved 15 August 2014.
  3. ^ Nathan Bijnens, [1], 11 December 2013, slide 24
  4. ^ Fangjin Yang, Gian Merlino [2], Real-time Analytics with Open Source Technologies, 30 July 2014, slide 42
  5. ^ Supreeth Rao, Sunil Gupta [3], Interactive Analytics in Human Time, 17 June 2014, slides 9 and 16