Lambda architecture: Difference between revisions

Content deleted Content added
Textractor (talk | contribs)
m Remove unnecessary comments and blank line
 
(97 intermediate revisions by 64 users not shown)
Line 1:
{{Short description|Data-processing architecture}}
{{Userspace draft|source=ArticleWizard|date=August 2014}}
[[File:Diagram of Lambda Architecture (generic).png|thumb|Flow of data through the processing and serving layers of a generic lambda architecture]]
 
'''Lambda architecture''' refers tois a [[data processing|data-processing]] architecture designed to handle massive quantities of data by taking advantage of both [[batch- processing|batch]] and [[stream processing|stream-processing]] methods. LambdaThis approach to architecture attempts to balance [[latency (engineering)|latency]], [[throughput]], and [[fault-tolerance]] by using batch processing to provide comprehensive and accurate precomputed views of batch data, while simultaneously using real-time stream processing to provide dynamic views of online data. The two view outputs may be joined before before presentation. The rise of lambda architecture is correlated with the growth of [[big data]], real-time analytics, and the drive to mitigate the latencies of the [[map-reduce approach]].<ref>{{cite web|last1=Schuster|first1=Werner|title=Nathan Marz on Storm, Immutability in the Lambda Architecture, Clojure|url=http://www.infoq.com/interviews/marz-lambda-architecture|website=www.infoq.com}} Interview with Nathan Marz, 6 April 2014</ref>
'''Lambda Architecture'''
 
Lambda architecture refers to a [[data processing|data-processing]] architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. Lambda architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate precomputed views, while simultaneously using real-time processing to provide dynamic views. The two view outputs may be joined before before presentation. The rise of lambda architecture is correlated with the growth of [[big data]], real-time analytics, and the drive to mitigate the latencies of the map-reduce approach.<ref>{{cite web|last1=Schuster|first1=Werner|title=Nathan Marz on Storm, Immutability in the Lambda Architecture, Clojure|url=http://www.infoq.com/interviews/marz-lambda-architecture|website=www.infoq.com}} Interview with Nathan Marz, 6 April 2014</ref>
 
 
Lambda architecture depends on a data model with an append-only, immutable data source that serves as a system of record.<ref name=bijnens-slide>Bijnens, Nathan. [http://lambda-architecture.net/architecture/2013-12-11-a-real-time-architecture-using-hadoop-and-storm-devoxx "A real-time architecture using Hadoop and Storm"]. 11 December 2013.</ref>{{rp|32}} It is intended for ingesting and processing timestamped events that are appended to existing events rather than overwriting them. State is determined from the natural time-based ordering of the data.
 
==Overview==
Lambda architecture describes a system consisting of three layers: batch processing, speed (or real-time) processing), and a serving layer for responding to queries.<ref name=big-data>Marz, Nathan; Warren, James. ''Big Data: Principles and best practices of scalable realtime data systems''. Manning Publications, 2013, p. 13.</ref>{{rp|13}} The processing layers ingest from aan immutable master copy of the entire data set. andThis actsparadigm aswas first described by Nathan Marz in a blog post titled "How to beat the system[[CAP oftheorem]]" recordin which he originally termed it the "batch/realtime architecture".<ref name=marz-blog>Marz, Nathan. [http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html "How to beat the CAP theorem"]. 13 October 2011.</ref>
 
===Batch Layerlayer===
The batch layer precomputes results using a distributed processing system that can handle very large quantities of data. ThisThe batch layer aims at providingperfect viewsaccuracy basedby onbeing processingable to process ''all'' available data towhen providegenerating theviews. mostThis accuratemeans views.it can [[Hadoop|Apachefix Hadoop]]any iserrors theby de-factorecomputing batch-processingbased systemon usedthe incomplete mostdata high-throughput architectures,<ref>Karset, Saroj.then [http://cloudtimes.org/2014/05/28/hadoop-sector-will-have-annual-growth-of-58-for-2013-2020/updating "Hadoopexisting Sectorviews. willOutput Haveis Annualtypically Growthstored ofin 58%a forread-only 2013-2020"]database, 28with Mayupdates 2014.completely ''Cloudreplacing existing precomputed Times''views.</ref> including lambda.name=big-data />{{rp|18}}
 
By 2014, [[Apache Hadoop]] was estimated to be a leading batch-processing system.<ref>Kar, Saroj. [http://cloudtimes.org/2014/05/28/hadoop-sector-will-have-annual-growth-of-58-for-2013-2020/ "Hadoop Sector will Have Annual Growth of 58% for 2013-2020"] {{Webarchive|url=https://archive.today/20140826020014/http://cloudtimes.org/2014/05/28/hadoop-sector-will-have-annual-growth-of-58-for-2013-2020/ |date=286August 2014}}, 28 May 2014. ''Cloud Times''.</ref> Later, other, relational databases like [[Snowflake Inc.|Snowflake]], Redshift, Synapse and Big Query were also used in this role.
===Speed Layer===
The speed layer processes data streams in real time and without the requirements of fix-ups or completeness. This layer sacrifices throughput as it aims to minimize latency by providing real-time views into the most recent data. Stream-processing technologies typically used in this layer include [[Apache Storm|Storm (event processor)]] and [[Apache Spark]].
 
===ServingSpeed Layerlayer===
[[File:Diagram of Lambda Architecture (named components).png|thumb|Diagram showing the flow of data through the processing and serving layers of lambda architecture. Example named components are shown.]]
Output from the batch and speed layers are stored in the serving layer, which responds to ad-hoc queries by building views from the processed data. [[Druid (open-source data store)|Druid]] provides a single cluster to handle output from both layers. Examples of dedicated stores used in the serving layer include [[Apache Cassandra]] or [[Apache HBase]] for the speed layer, and [https://github.com/nathanmarz/elephantdb Elephant DB] or [[Cloudera Impala]] for the batch layer.
The speed layer processes data streams in real time and without the requirements of fix-ups or completeness. This layer sacrifices throughput as it aims to minimize latency by providing real-time views into the most recent data. Essentially, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate or complete as the ones eventually produced by the batch layer, but they are available almost immediately after data is received, and can be replaced when the batch layer's views for the same data become available.<ref name=big-data />{{rp|203}}
 
Stream-processing technologies typically used in this layer include [[Apache Kafka]], [[Amazon Web Services|Amazon Kinesis]], [[Storm (event processor)|Apache Storm]], SQLstream, [[Apache Samza]], [[Apache Spark]], [[Azure Stream Analytics]], [[Apache Flink]]. Output is typically stored on fast NoSQL databases.,<ref name=kinley>Kinley, James. [http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for-architecting "The Lambda architecture: principles for architecting realtime Big Data systems"] {{Webarchive|url=https://web.archive.org/web/20140904183723/http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for-architecting |date=2014-09-04 }}, retrieved 26 August 2014.</ref><ref>Ferrera Bertran, Pere. [https://web.archive.org/web/20190312082929/http://www.datasalt.com/2014/01/lambda-architecture-a-state-of-the-art/ "Lambda Architecture: A state-of-the-art"]. 17 January 2014, Datasalt.</ref> or as a commit log.<ref name=commit_log>Confluent.[https://developer.confluent.io/what-is-apache-kafka/#kafka-and-events--keyvalue-pairs "Kafka and Events – Key/Value Pairs"], retrieved 06 October 2022.</ref>
 
===Serving layer===
[[File:Diagram of Lambda Architecture (Druid data store).png|thumb|Diagram showing a lambda architecture with a Druid data store.]]
Output from the batch and speed layers are stored in the serving layer, which responds to ad-hoc queries by returning precomputed views or building views from the processed data.
 
Examples of technologies used in the serving layer include [[Apache Druid]], [[Apache Pinot]], [[ClickHouse]] and [https://tinybird.co Tinybird] which provide a single platform to handle output from both layers.<ref name=metamarkets-lambda>Yang, Fangjin, and Merlino, Gian. [https://speakerdeck.com/druidio/real-time-analytics-with-open-source-technologies-1 "Real-time Analytics with Open Source Technologies"]. 30 July 2014.</ref> Dedicated stores used in the serving layer include [[Apache Cassandra]], [[Apache HBase]], [[Cosmos DB|Azure Cosmos DB]], [[MongoDB]], [[VoltDB]] or [[Elasticsearch]] for speed-layer output, and [https://github.com/nathanmarz/elephantdb Elephant DB], [[Apache Impala]], [[SAP HANA]] or [[Apache Hive]] for batch-layer output.<ref name=bijnens-slide />{{rp|45}}<ref name=kinley />
Relies on a combination of computation techniques such as ''partial recomputation'' (p. 287) and estimation (hyperloglog), as well as optimizations in resource usage (p. 293) and data transformations. Recomputation is required for fault tolerance, while incremental computation algorithms may be selectively added to increase efficiency.
 
==Optimizations==
==Data Characteristics For Lambda Architecture==
To optimize the data set and improve query efficiency, various rollup and aggregation techniques are executed on raw data,<ref name=metamarkets-lambda />{{rp|23}} while estimation techniques are employed to further reduce computation costs.<ref>Ray, Nelson. [https://metamarkets.com/2013/histograms/ "The Art of Approximating Distributions: Histograms and Quantiles at Scale"]. 12 September 2013. Metamarkets.</ref> And while expensive full recomputation is required for fault tolerance, incremental computation algorithms may be selectively added to increase efficiency, and techniques such as ''partial computation'' and resource-usage optimizations can effectively help lower latency.<ref name=big-data />{{rp|93,287,293}}
Event data, using timestamps to append, order data and causality. Data is immutable.
 
==Lambda Architecturearchitecture in Useuse==
Metamarkets, which provides analytics for companies in the programmatic advertising space, employs a version of the lambda architecture that uses [[Druid (open-source data store)|Druid]] for storing and serving both the streamed and batch-processed data.<ref name=metamarkets-lambda />{{rp|42}}
In practice, each of the three layers can be built from any of a number of suitable components. However,
 
For running analytics on its advertising data warehouse, [[Yahoo]] has taken a similar approach, also using [[Storm (event processor)|Apache Storm]], [[Hadoop|Apache Hadoop]], and [[Druid (open-source data store)|Druid]].<ref name=yahoo-lambda>Rao, Supreeth; Gupta, Sunil. [http://www.slideshare.net/Hadoop_Summit/interactive-analytics-in-human-time?next_slideshow=1 "Interactive Analytics in Human Time"]. 17 June 2014</ref>{{rp|9,16}}
For the serving layer, some implementations have used Cassandra to store and serve views from the speed layer, and Elephant DB to store and serve views from the batch layer.<ref name=bijnens-slide>Bijnens, Nathan. [http://lambda-architecture.net/architecture/2013-12-11-a-real-time-architecture-using-hadoop-and-storm-devoxx/ "A real-time architecture using Hadoop and Storm"]. 11 December 2013, slide 24</ref> Other implementations have used Apache HBase for serving views from the speed layer, and Cloudera's Impala for serving views from the batch layer.<ref>http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for-architecting</ref>
 
The [[Netflix]] Suro project has separate processing paths for data, but does not strictly follow lambda architecture since the paths may be intended to serve different purposes and not necessarily to provide the same type of views.<ref name="netflix">Bae, Jae Hyeon; Yuan, Danny; Tonse, Sudhir. [http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html "Announcing Suro: Backbone of Netflix's Data Pipeline"], ''[[Netflix]]'', 9 December 2013</ref> Nevertheless, the overall idea is to make selected real-time event data available to queries with very low latency, while the entire data set is also processed via a batch pipeline. The latter is intended for applications that are less sensitive to latency and require a map-reduce type of processing.
[http://metamarkets.com Metamarkets], which provides analytics for players in the programmatic advertising space, employs a version of the lambda architecture that uses [[Druid (open-source data store)]] for storing and serving both the streamed and batch-processed data.<ref name=metamarkets-lambda>Yang, Fangjin, and Merlino, Gian. [https://speakerdeck.com/druidio/real-time-analytics-with-open-source-technologies-1 "Real-time Analytics with Open Source Technologies"]. 30 July 2014, slide 42</ref> For running analytics on its advertising data warehouse, [[Yahoo]] has taken a similar approach, also using Apache Storm, Hadoop, and Druid.<ref name=yahoo-lambda>Rao, Supreeth, and Gupta, Sunil. [http://www.slideshare.net/Hadoop_Summit/interactive-analytics-in-human-time?next_slideshow=1 "Interactive Analytics in Human Time"]. 17 June 2014, slides 9 and 16</ref>
 
==Criticism and alternatives==
The [[Netflix]] Suro project has separate processing paths for data, but does not strictly follow lambda architecture since the paths may be intended to serve different purposes and not to provide the same type of views.<ref name="netflix">Bae, Jae Hyeon; Yuan, Danny; Tonse, Sudhir. [http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html "Announcing Suro: Backbone of Netflix's Data Pipeline"], ''[[Netflix]]'', 9 December 2013</ref> Nevertheless, the overall idea is to make selected real-time event data available to queries with very low latency, while the entire data set is also processed via a batch pipeline. The latter is intended for applications that are less sensitive to latency and require a map-reduce type of processing.
Criticism of lambda architecture has focused on its inherent complexity and its limiting influence. The batch and streaming sides each require a different code base that must be maintained and kept in sync so that processed data produces the same result from both paths,. whileYet attempting to abstract the code bases into a single framework puts many of the specialized tools in eachthe side'sbatch and real-time ecosystems out of reach.<ref name="kappa">{{cite web |last1=KrebsKreps |first1=Jay |title=Questioning the Lambda ArchitecureArchitecture |url=httphttps://radarwww.oreilly.com/2014/07radar/questioning-the-lambda-architecture.html/ |accessdate=3 October 2024 |website=radar.oreilly.com |publisherdate=Oreilly|accessdate=152 AugustJuly 2014 |publisher=Oreilly |ref=krebskreps}}</ref>
 
===Kappa architecture===
==Criticism==
Jay Kreps introduced the kappa architecture to use a pure streaming approach with a single code base.<ref name=kappa /> In a technical discussion over the merits of employing a pure streaming approach, it was noted that using a flexible streaming framework such as [[Apache Samza]] could provide some of the same benefits as batch processing without the latency.<ref>[https://news.ycombinator.com/item?id=7976785 Hacker News] retrieved 20 August 2014</ref> Such a streaming framework could allow for collecting and processing arbitrarily large windows of data, accommodate blocking, and handle state.
Criticism of lambda architecture has focused on its inherent complexity and its limiting influence. The batch and streaming sides each require a different code base that must be maintained and kept in sync so that processed data produces the same result from both paths, while attempting to abstract the code bases into a single framework puts many of the specialized tools in each side's ecosystems out of reach.<ref>{{cite web|last1=Krebs|first1=Jay|title=Questioning the Lambda Architecure|url=http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html|website=radar.oreilly.com|publisher=Oreilly|accessdate=15 August 2014|ref=krebs}}</ref>
 
==See also==
In a technical discussion over the merits of employing a pure streaming approach, it was noted that using a flexible streaming framework such as [[Apache Samza]] could provide some of the same benefits as batch processing without the latency.<ref>[https://news.ycombinator.com/item?id=7976785 Hacker News] retrieved 20 August 2014</ref> Such a streaming framework could allow for collecting and processing arbitrarily large windows of data, accommodate blocking, and handle state.
*[[Event stream processing]]
 
== References ==
<!--- See http://en.wikipedia.org/wiki/Wikipedia:Footnotes on how to create references using <ref></ref> tags, these references will then appear here automatically -->
{{Reflist}}
 
[[Category:Data processing]]
[[Category:Big data]]
[[Category:Free software projects]]
[[Category:Software architecture]]
 
[[Category:Data engineering]]
== External links ==
* [http://lambda-architecture.net/ Repository of Information on Lambda of Architecture]
 
<!--- Categories --->
 
[[Category:Articles created via the Article Wizard]]