Revision as of 17:36, 4 October 2014 edit WilliamKF (talk \| contribs) Extended confirmed users, Pending changes reviewers 8,422 edits →top ← Previous edit		Revision as of 11:26, 5 October 2014 edit undo 178.2.129.72 (talk) Rewrite, what was a description of "Divide and conquer", not of map-reduce. Next edit →
Line 18: 'MapReduce' is a framework for processing [[Parallel computing\|parallelizable]] problems across huge datasets using a large number of computers (nodes), collectively referred to as a [[Computer cluster\|cluster]] (if all nodes are on the same local network and use similar hardware) or a [[Grid Computing\|grid]] (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Computational processing can occur on data stored either in a [[filesystem]] (unstructured) or in a [[database]] (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted. '''"Map" step:''' Each worker nodes applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed. '''"Map" step:''' The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level [[Tree (data structure)\|tree]] structure. The worker node processes the smaller problem, and passes the answer back to its master node. '''"~~Reduce~~Shuffle" step:''' ~~The~~Worker ~~master~~nodes ~~node~~redistribute data ~~then~~based ~~collects~~on the ~~answers~~output tokeys ~~all~~(produced by the ~~sub-problems~~"map()" ~~and~~function), ~~combines~~such ~~them~~that inall ~~some~~data ~~way~~belonging to ~~form~~one ~~the~~key ~~output –~~is ~~the~~located ~~answer to~~on the ~~problem~~same itworker ~~was originally trying to solve~~node. '''"Reduce" step:''' Worker nodes now process each group of output data, per key, in parallel. MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel – though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is [[Associative property\|associative]]. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large [[server farm]] can use MapReduce to sort a [[petabyte]] of data in only a few hours.<ref>{{cite web\|last=Czajkowski\|first=Grzegorz,\|title=Sorting Petabytes with MapReduce - The Next Episode\|url=http://googleresearch.blogspot.com/2011/09/sorting-petabytes-with-mapreduce-next.html\|publisher=Google\|accessdate=7 April 2014\|author2=Marián Dvorský \|author3=Jerry Zhao \|author4=Michael Conley \|archivedate=7 September 2011}}</ref> The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.

MapReduce: Difference between revisions