Content deleted Content added
Rewrite, what was a description of "Divide and conquer", not of map-reduce. |
m bullets; slight rewording for sentence flow |
||
Line 16:
==Overview==
* '''"Map" step:''' Each worker nodes applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.
* '''"Shuffle" step:''' Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.
* '''"
MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel – though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is [[Associative property|associative]]. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large [[server farm]] can use MapReduce to sort a [[petabyte]] of data in only a few hours.<ref>{{cite web|last=Czajkowski|first=Grzegorz,|title=Sorting Petabytes with MapReduce - The Next Episode|url=http://googleresearch.blogspot.com/2011/09/sorting-petabytes-with-mapreduce-next.html|publisher=Google|accessdate=7 April 2014|author2=Marián Dvorský |author3=Jerry Zhao |author4=Michael Conley |archivedate=7 September 2011}}</ref> The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.
Line 34 ⟶ 32:
# '''Produce the final output''' – the MapReduce system collects all the Reduce output, and sorts it by ''K2'' to produce the final outcome.
In many situations, the input data might already be distributed ([[Shard (database architecture)|"sharded"]]) among many different servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers that would process the locally present input data. Similarly, step 3 could sometimes be sped up by assigning Reduce processors that are as close as possible to the Map-generated data they need to process.
|