MapReduce: Difference between revisions

Content deleted Content added
Linuxjava (talk | contribs)
No edit summary
Overview: Corrected spelling mistake
Line 18:
MapReduce is a framework for processing [[Parallel computing|parallelizable]] problems across huge datasets using a large number of computers (nodes), collectively referred to as a [[Computer cluster|cluster]] (if all nodes are on the same local network and use similar hardware) or a [[Grid Computing|grid]] (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored either in a [[filesystem]] (unstructured) or in a [[database]] (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.
 
* '''"Map" step:''' Each worker nodesnode applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.
* '''"Shuffle" step:''' Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.
* '''"Reduce" step:''' Worker nodes now process each group of output data, per key, in parallel.