Revision as of 15:28, 1 December 2020 edit 65.92.13.144 (talk) →Hadoop ← Previous edit		Revision as of 23:30, 27 April 2021 edit undo Comp.arch (talk \| contribs) Extended confirmed users 41,478 edits mNo edit summary Tag: 2017 wikitext editor Next edit →
Line 35: ===MapReduce=== The [[MapReduce]] architecture and programming model pioneered by [[Google]] is an example of a modern systems architecture designed for data-intensive computing.<ref>[http://labs.google.com/papers/mapreduce-osdi04.pdf MapReduce: Simplified Data Processing on Large Clusters] {{Webarchive\|url=https://web.archive.org/web/20091223010101/http://labs.google.com/papers/mapreduce-osdi04.pdf \|date=2009-12-23 }} by J. Dean, and S. Ghemawat. Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI), 2004.</ref> The MapReduce architecture allows programmers to use a functional programming style to create a map function that processes a [[~~key-value~~attribute–value pair\|key–value pair]] associated with the input data to generate a set of intermediate [[~~key-value~~attribute–value pair\|key–value pairs]]s, and a reduce function that merges all intermediate values associated with the same intermediate key. Since the system automatically takes care of details like partitioning the input data, scheduling and executing tasks across a processing cluster, and managing the communications between nodes, programmers with no experience in parallel programming can easily use a large distributed processing environment. The programming model for [[MapReduce]] architecture is a simple abstraction where the computation takes a set of input ~~key-value~~key–value pairs associated with the input data and produces a set of output ~~key-value~~key–value pairs. In the Map phase, the input data is partitioned into input splits and assigned to Map tasks associated with processing nodes in the cluster. The Map task typically executes on the same node containing its assigned partition of data in the cluster. These Map tasks perform user-specified computations on each input ~~key-value~~key–value pair from the partition of input data assigned to the task, and generates a set of intermediate results for each key. The shuffle and sort phase then takes the intermediate data generated by each Map task, sorts this data with intermediate data from other nodes, divides this data into regions to be processed by the reduce tasks, and distributes this data as needed to nodes where the Reduce tasks will execute. The Reduce tasks perform additional user-specified operations on the intermediate data possibly merging values associated with a key to a smaller set of values to produce the output data. For more complex data processing procedures, multiple MapReduce calls may be linked together in sequence. ===Hadoop===

Data-intensive computing: Difference between revisions