MapReduce: Difference between revisions

Content deleted Content added
Rewrite, what was a description of "Divide and conquer", not of map-reduce.
m bullets; slight rewording for sentence flow
Line 16:
==Overview==
 
'MapReduce' is a framework for processing [[Parallel computing|parallelizable]] problems across huge datasets using a large number of computers (nodes), collectively referred to as a [[Computer cluster|cluster]] (if all nodes are on the same local network and use similar hardware) or a [[Grid Computing|grid]] (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Computational processingProcessing can occur on data stored either in a [[filesystem]] (unstructured) or in a [[database]] (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.
 
* '''"Map" step:''' Each worker nodes applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.
* '''"Shuffle" step:''' Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.
 
* '''"ShuffleReduce" step:''' Worker nodes redistributenow dataprocess basedeach ongroup theof output keys (produced by the "map()" function)data, such that all data belonging to oneper key, is located on the same workerin nodeparallel.
 
'''"Reduce" step:''' Worker nodes now process each group of output data, per key, in parallel.
 
MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel&nbsp;&ndash; though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is [[Associative property|associative]]. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle&nbsp;&ndash; a large [[server farm]] can use MapReduce to sort a [[petabyte]] of data in only a few hours.<ref>{{cite web|last=Czajkowski|first=Grzegorz,|title=Sorting Petabytes with MapReduce - The Next Episode|url=http://googleresearch.blogspot.com/2011/09/sorting-petabytes-with-mapreduce-next.html|publisher=Google|accessdate=7 April 2014|author2=Marián Dvorský |author3=Jerry Zhao |author4=Michael Conley |archivedate=7 September 2011}}</ref> The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled&nbsp;&ndash; assuming the input data is still available.
Line 34 ⟶ 32:
# '''Produce the final output''' – the MapReduce system collects all the Reduce output, and sorts it by ''K2'' to produce the final outcome.
 
LogicallyThese these 5five steps can be Logically thought of as running in sequence – each step starts only after the previous step is completed – thoughalthough in practice they can be interleaved, as long as the final result is not affected.
 
In many situations, the input data might already be distributed ([[Shard (database architecture)|"sharded"]]) among many different servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers that would process the locally present input data. Similarly, step 3 could sometimes be sped up by assigning Reduce processors that are as close as possible to the Map-generated data they need to process.