Revision as of 10:47, 12 October 2014 edit Linuxjava (talk \| contribs) 386 edits No edit summary Tag: Visual edit ← Previous edit		Revision as of 10:59, 12 October 2014 edit undo 213.113.218.135 (talk) →Overview: Corrected spelling mistake Next edit →
Line 18: MapReduce is a framework for processing [[Parallel computing\|parallelizable]] problems across huge datasets using a large number of computers (nodes), collectively referred to as a [[Computer cluster\|cluster]] (if all nodes are on the same local network and use similar hardware) or a [[Grid Computing\|grid]] (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored either in a [[filesystem]] (unstructured) or in a [[database]] (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted. * '''"Map" step:''' Each worker ~~nodes~~node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed. * '''"Shuffle" step:''' Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. * '''"Reduce" step:''' Worker nodes now process each group of output data, per key, in parallel.

MapReduce: Difference between revisions