Content deleted Content added
No edit summary |
→Overview: Corrected spelling mistake |
||
Line 18:
MapReduce is a framework for processing [[Parallel computing|parallelizable]] problems across huge datasets using a large number of computers (nodes), collectively referred to as a [[Computer cluster|cluster]] (if all nodes are on the same local network and use similar hardware) or a [[Grid Computing|grid]] (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored either in a [[filesystem]] (unstructured) or in a [[database]] (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.
* '''"Map" step:''' Each worker
* '''"Shuffle" step:''' Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.
* '''"Reduce" step:''' Worker nodes now process each group of output data, per key, in parallel.
|