MapReduce: Difference between revisions

Content deleted Content added
Rewrite, what was a description of "Divide and conquer", not of map-reduce.
Line 18:
'MapReduce' is a framework for processing [[Parallel computing|parallelizable]] problems across huge datasets using a large number of computers (nodes), collectively referred to as a [[Computer cluster|cluster]] (if all nodes are on the same local network and use similar hardware) or a [[Grid Computing|grid]] (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Computational processing can occur on data stored either in a [[filesystem]] (unstructured) or in a [[database]] (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.
 
'''"Map" step:''' Each worker nodes applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.
'''"Map" step:''' The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level [[Tree (data structure)|tree]] structure. The worker node processes the smaller problem, and passes the answer back to its master node.
 
'''"ReduceShuffle" step:''' TheWorker masternodes noderedistribute data thenbased collectson the answersoutput tokeys all(produced by the sub-problems"map()" andfunction), combinessuch themthat inall somedata waybelonging to formone thekey output –is thelocated answer toon the problemsame itworker was originally trying to solvenode.
 
'''"Reduce" step:''' Worker nodes now process each group of output data, per key, in parallel.
 
MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel&nbsp;&ndash; though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is [[Associative property|associative]]. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle&nbsp;&ndash; a large [[server farm]] can use MapReduce to sort a [[petabyte]] of data in only a few hours.<ref>{{cite web|last=Czajkowski|first=Grzegorz,|title=Sorting Petabytes with MapReduce - The Next Episode|url=http://googleresearch.blogspot.com/2011/09/sorting-petabytes-with-mapreduce-next.html|publisher=Google|accessdate=7 April 2014|author2=Marián Dvorský |author3=Jerry Zhao |author4=Michael Conley |archivedate=7 September 2011}}</ref> The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled&nbsp;&ndash; assuming the input data is still available.