MapReduce

MapReduce è un framework software brevettato e introdotto da Google per supportare la computazione distribuita su grandi quantità di dati in cluster di computer. Il framework è ispirato alle funzioni map e reduce usate nella programmazione funzionale, sebbene il loro scopo nel framework MapReduce non è lo stesso che nella forma originale. Le librerie MapReduce sono scritte in C++, C#, Erlang, Java, Ocaml, Perl, Python, Ruby, F# e altri linguaggi di programmazione.

Panoramica

Passo "Map": il nodo principale prende l'input, lo partizioni in sotto-problemi più piccoli, e li distribuisce ai nodi operativi. un nodo operativo potrebbe fare questo nuovamente, portando a una struttura ad albero multi-livello. Il nodo operativo processa i problemi più piccoli, e restituisce la risposta al suo nodo principale.

Passo "Reduce": Il nodo principale una volta prese le risposte di tutti i sotto-problemi le combina per ottenere una risposta o uscita, la risposta al problema che si tenta di risolvere.

Il vantaggio della Map Reduce è che permette una processazione distribuita delle operazioni di mappamento e riduzione. Fornendo ogni operazione di map indipendente dalle altre, tutte le map possono essere eseguite in parallelo - sebbene nella pratica ciò è limitato dalla sorgente dati e/o dal numero di CPU vicine a quel dato. Alla stessa maniera, una serie di "riduttori" può eseguire la fase di riduzione - tutto quello che è richiesto è che le uscite della map la quale condivide la stessa chiave sia presentata allo stesso riduttore, allo stesso tempo.

Mentre questo processo può spesso apparire inefficiente comparato agli algoritmi che sono più sequenziali, MapReduce può essere applicato significativamente a più grandi quantità di dati che i server possono gestire comodamente - una grande "fattoria" di server può usare MapReduce per ordinare petabyte di dati in sole poche ore.

Il parallelismo offre anche la possibilità di recuperare dati dal parziale fallimento di server o di dispositivi di archiviazione durante l'operazione se qualche map o reduce fallisce il lavoro può essere riprogettato, assumento che i dati in entrata siano ancora disponibili.

Voci correlate

Apache Hadoop

Collegamenti esterni

Articoli

(EN) "Interpreting the Data: Parallel Analysis with Sawzall" — paper by Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan; from Google Labs
(EN) "Evaluating MapReduce for Multi-core and Multiprocessor Systems" — paper by Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis; from Stanford University
(EN) "Why MapReduce Matters to SQL Data Warehousing" — analysis related to the August, 2008 introduction of MapReduce/SQL integration by Aster Data Systems and Greenplum
(EN) "MapReduce for the Cell B.E. Architecture" — paper by Marc de Kruijf and Karthikeyan Sankaralingam; from University of Wisconsin–Madison
(EN) "Mars: A MapReduce Framework on Graphics Processors" — paper by Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, Tuyong Wang; from Hong Kong University of Science and Technology; published in Proc. PACT 2008. It presents the design and implementation of MapReduce on graphics processors.
(EN) "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters" — paper by Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker; from Yahoo and UCLA; published in Proc. of ACM SIGMOD, pp. 1029–1040, 2007. (This paper shows how to extend MapReduce for relational data processing.)
(EN) FLuX: the Fault-tolerant, Load Balancing eXchange operator from UC Berkeley provides an integration of partitioned parallelism with process pairs. This results in a more pipelined approach than Google's MapReduce with instantaneous failover, but with additional implementation cost.
(EN) "A New Computation Model for Rack-Based Computing" — paper by Foto N. Afrati; Jeffrey D. Ullman; from Stanford University; Not published as of Nov 2009. This paper is an attempt to develop a general model in which one can compare algorithms for computing in an environment similar to what map-reduce expects.
(EN) FPMR: MapReduce framework on FPGA -- paper by Yi Shan, Bo Wang, Jing Yan, Yu Wang, Ningyi Xu, Huazhong Yang (2010), in FPGA '10, Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays.
(EN) "Tiled-MapReduce: Optimizing Resource Usages of Data-parallel Applications on Multicore with Tiling" -- paper by Rong Chen, Haibo Chen and Binyu Zang from Fudan University; published in Proc. PACT 2010. It presents the Tiled-MapReduce programming model which optimizes resource usages of MapReduce applications on multicore environment using tiling strategy.
(EN) "Scheduling divisible MapReduce computations " -- paper by Joanna Berlińska from Adam Mickiewicz University and Maciej Drozdowski from Poznan University of Technology; Journal of Parallel and Distributed Computing 71 (2011) 450-459, doi:10.1016/j.jpdc.2010.12.004. It presents scheduling and performance model of MapReduce.

Libri

(EN) Jimmy Lin and Chris Dyer. "Data-Intensive Text Processing with MapReduce" (manuscript)

Educational courses

(EN) Cluster Computing and MapReduce course from Google Code University contains video lectures and related course materials from a series of lectures that was taught to Google software engineering interns during the Summer of 2007.
(EN) MapReduce in a Week course from Google Code University contains a comprehensive introduction to MapReduce including lectures, reading material, and programming assignments.
(EN) MapReduce course, taught by engineers of Google Boston, part of 2008 Independent Activities Period at MIT.