MapReduce: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 11:26, 5 October 2014 edit 178.2.129.72 (talk) Rewrite, what was a description of "Divide and conquer", not of map-reduce. ← Previous edit		Latest revision as of 05:43, 21 August 2025 edit undo InternetArchiveBot (talk \| contribs) Bots, Pending changes reviewers 5,687,097 edits Rescuing 2 sources and tagging 0 as dead.) #IABot (v2.0.9.5
(357 intermediate revisions by more than 100 users not shown)
Line 1: {{Short description\|Parallel programming model}} ~~{{technical\|date=September 2014}}~~ '''MapReduce''' is a [[programming model]] and an associated implementation for processing and generating ~~large~~[[big data]] sets with a [[Parallel computing\|parallel]], and [[distributed computing\|distributed]] algorithm on a [[Cluster (computing)\|cluster]].<ref>[{{cite web\|url=https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html\|title=MapReduce Tutorial\|access-date=3 July 2019\|website=Apache Hadoop}}</ref><ref>{{cite web\|url=http://news.cnet.com/8301-10784_3-9955184-7.html \|title=Google spotlights data center inner workings \|date=30 ~~Tech~~May ~~news~~2008\|website=cnet.com\|access-date=31 ~~blog~~May 2008\|archive-date=19 ~~CNET~~October ~~News~~2013\|archive-url=https://web.archive.org/web/20131019063218/http://news.cnet.com<!/8301-10784_3- ~~Bot generated title~~ 9955184-7.html\|url->]status=dead}}</ref><ref name="GoogleMapReduce">[{{cite web\|url=http://static.googleusercontent.com/media/research.google.com/es/us/archive/mapreduce-osdi04.pdf \|title=MapReduce: Simplified Data Processing on Large Clusters]\|website=googleusercontent.com}}</ref> A MapReduce program is composed of a ~~'''Map~~[[map (parallel pattern)\|''map'']] [[procedure ~~that~~(computing)\|procedure]], which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a '''[[Reduce (parallel pattern)'\|reduce]]'' ~~procedure~~method, ~~that~~which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by [[Marshalling (computer science)\|marshalling]] the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for [[Redundancy (engineering)\|redundancy]] and [[Fault-tolerant computer system\|fault tolerance]]. The model is a specialization of the ''split-apply-combine'' strategy for data analysis.<ref>{{Cite journal \| doi = 10.18637/jss.v040.i01\| title = The split-apply-combine strategy for data analysis\| journal = Journal of Statistical Software\| volume = 40\| pages = 1–29\| year = 2011\| last1 = Wickham\| first1 = Hadley \| doi-access = free}}</ref> The model is inspired by the [[map (higher-order function)\|map]] and [[fold (higher-order function)\|reduce]] functions commonly used in [[functional programming]],<ref name="map">"Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -[http://research.google.com/archive/mapreduce.html "MapReduce: Simplified Data Processing on Large Clusters"], by Jeffrey Dean and Sanjay Ghemawat; from [[Google Research]]</ref> although their purpose in the MapReduce framework is not the same as in their original forms.<ref>{{cite doi\|10.1016/j.scico.2007.07.001}}</ref> The key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once. As such, a [[single-threaded]] implementation of MapReduce (such as MongoDB) will usually not be faster than a traditional (non-MapReduce) implementation, any gains are usually only seen with [[multi-threaded]] implementations.<ref name=stackoverflow>{{cite web It is inspired by the [[map (higher-order function)\|map]] and [[reduce (higher-order function)\|reduce]] functions commonly used in [[functional programming]],<ref name="map">"Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -[http://research.google.com/archive/mapreduce.html "MapReduce: Simplified Data Processing on Large Clusters"], by Jeffrey Dean and Sanjay Ghemawat; from Google Research</ref> although their purpose in the MapReduce framework is not the same as in their original forms.<ref>{{Cite journal \| doi = 10.1016/j.scico.2007.07.001\| title = Google's Map ''Reduce'' programming model — Revisited\| journal = Science of Computer Programming\| volume = 70\| pages = 1–30\| year = 2008\| last1 = Lämmel \| first1 = R. \| doi-access = }}</ref> The key contributions of the MapReduce framework are not the actual map and reduce functions (which, for example, resemble the 1995 [[Message Passing Interface]] standard's<ref>http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/mpi2-report.htm MPI 2 standard</ref> ''reduce''<ref>{{cite web\|url=http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/\|title=MPI Reduce and Allreduce · MPI Tutorial\|website=mpitutorial.com}}</ref> and ''scatter''<ref>{{cite web\|url=http://mpitutorial.com/tutorials/performing-parallel-rank-with-mpi/\|title=Performing Parallel Rank with MPI · MPI Tutorial\|website=mpitutorial.com}}</ref> operations), but the scalability and fault-tolerance achieved for a variety of applications due to parallelization. As such, a [[single-threaded]] implementation of MapReduce is usually not faster than a traditional (non-MapReduce) implementation; any gains are usually only seen with [[multi-threaded]] implementations on multi-processor hardware.<ref name=stackoverflow>{{cite web \| url = https://stackoverflow.com/questions/3947889/mongodb-terrible-mapreduce-performance \| title = MongoDB: Terrible MapReduce Performance Line 10 ⟶ 11: \| date = October 16, 2010 \| quote = The MapReduce implementation in MongoDB has little to do with map reduce apparently. Because for all I read, it is single-threaded, while map-reduce is meant to be used highly parallel on a cluster. ... MongoDB MapReduce is single threaded on a single server... }}</ref> ~~Only~~The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play,. isOptimizing the ~~use~~communication ofcost ~~this~~is ~~model~~essential ~~beneficial~~to a good MapReduce algorithm.<ref name="ullman" /> MapReduce [[library (software)\|libraries]] have been written in many programming languages, with different levels of optimization. A popular [[open-source software\|open-source]] implementation that has support for distributed shuffles is part of [[Apache Hadoop]]. The name MapReduce originally referred to the proprietary [[Google]] technology, but has since ~~been~~become a [[generic trademark]]. By 2014, Google was no longer using MapReduce as its primary ''[[big data]]'' processing model,<ref>{{cite web\|~~genericized~~url=http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/ \| title=Google Dumps MapReduce in Favor of New Hyper-Scale Analytics System \|last1=Sverdlik \|first1=Yevgeniy \|date=2014-06-25 \|website=Data Center Knowledge \|access-date=2015-10-25 \|quote="We don't really use MapReduce anymore" [Urs Hölzle, senior vice president of technical infrastructure at Google]}}</ref> and development on [[Apache Mahout]] had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.<ref>{{cite news \|url=https://analyticsindiamag.com/ai-origins-evolution/why-mapreduce-is-still-a-dominant-approach-for-large-scale-machine-learning/ \| title=Why MapReduce Is Still A Dominant Approach For Large-Scale Machine Learning \| work=Analytics India \| date=April 5, 2019}}</ref> ==Overview== MapReduce is a framework for processing [[Parallel computing\|parallelizable]] problems across large datasets using a large number of computers (nodes), collectively referred to as a [[Computer cluster\|cluster]] (if all nodes are on the same local network and use similar hardware) or a [[Grid Computing\|grid]] (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware). Processing can occur on data stored either in a [[filesystem]] (unstructured) or in a [[database]] (structured). MapReduce can take advantage of the locality of data, processing it near the place it is stored in order to minimize communication overhead. A MapReduce framework (or system) is usually composed of three operations (or steps): 'MapReduce' is a framework for processing [[Parallel computing\|parallelizable]] problems across huge datasets using a large number of computers (nodes), collectively referred to as a [[Computer cluster\|cluster]] (if all nodes are on the same local network and use similar hardware) or a [[Grid Computing\|grid]] (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Computational processing can occur on data stored either in a [[filesystem]] (unstructured) or in a [[database]] (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted. # '''"Map~~" step~~:''' ~~Each~~each worker ~~nodes~~node applies the "<code>map~~()"~~</code> function to the local data, and writes the output to a temporary storage. A master node ~~orchestrates~~ensures that ~~for~~only ~~redundant~~one ~~copies~~copy of the redundant input data~~, only one~~ is processed. # '''Shuffle:''' worker nodes redistribute data based on the output keys (produced by the <code>map</code> function), such that all data belonging to one key is located on the same worker node. # '''Reduce:''' worker nodes now process each group of output data, per key, in parallel. MapReduce allows for the distributed processing of the map and reduction operations. Maps can be performed in parallel, provided that each mapping operation is independent of the others; in practice, this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is [[Associative property\|associative]]. While this process often appears inefficient compared to algorithms that are more sequential (because multiple instances of the reduction process must be run), MapReduce can be applied to significantly larger datasets than a single [[Commodity computing\|"commodity" server]] can handle – a large [[server farm]] can use MapReduce to sort a [[petabyte]] of data in only a few hours.<ref>{{cite web\|last=Czajkowski\|first=Grzegorz\|title=Sorting Petabytes with MapReduce – The Next Episode\|url=https://googleresearch.blogspot.com/2011/09/sorting-petabytes-with-mapreduce-next.html\|access-date=7 April 2014\|author2=Marián Dvorský \|author3=Jerry Zhao \|author4=Michael Conley \|date=7 September 2011 }}</ref> The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data are still available. ~~'''"Shuffle" step:''' Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.~~ ~~'''"Reduce" step:''' Worker nodes now process each group of output data, per key, in parallel.~~ MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel – though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is [[Associative property\|associative]]. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large [[server farm]] can use MapReduce to sort a [[petabyte]] of data in only a few hours.<ref>{{cite web\|last=Czajkowski\|first=Grzegorz,\|title=Sorting Petabytes with MapReduce - The Next Episode\|url=http://googleresearch.blogspot.com/2011/09/sorting-petabytes-with-mapreduce-next.html\|publisher=Google\|accessdate=7 April 2014\|author2=Marián Dvorský \|author3=Jerry Zhao \|author4=Michael Conley \|archivedate=7 September 2011}}</ref> The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available. Another way to look at MapReduce is as a 5-step parallel and distributed computation: # '''Prepare the Map() input''' – the "MapReduce system" designates Map processors, assigns the input key ~~value~~ ''K1'' that each processor would work on, and provides that processor with all the input data associated with that key ~~value~~. # '''Run the user-provided Map() code''' – Map() is run exactly once for each ''K1'' key ~~value~~, generating output organized by key ~~values~~ ''K2''. # '''"Shuffle" the Map output to the Reduce processors''' – the MapReduce system designates Reduce processors, assigns the ''K2'' key ~~value~~ each processor should work on, and provides that processor with all the Map-generated data associated with that key ~~value~~. # '''Run the user-provided Reduce() code''' – Reduce() is run exactly once for each ''K2'' key ~~value~~ produced by the Map step. # '''Produce the final output''' – the MapReduce system collects all the Reduce output, and sorts it by ''K2'' to produce the final outcome. ~~Logically~~These ~~these 5~~five steps can be logically thought of as running in sequence – each step starts only after the previous step is completed – ~~though~~although in practice they can be interleaved, as long as the final result is not affected. In many situations, the input data might have already bebeen distributed ([[Shard (database architecture)\|"sharded"]]) among many different servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers that would process the locally present input data. Similarly, step 3 could sometimes be sped up by assigning Reduce processors that are as close as possible to the Map-generated data they need to process. ==Logical view== Line 43: <code>Map(k1,v1)</code> → <code>list(k2,v2)</code> The ''Map'' function is applied in parallel to every pair (keyed by <code>k1</code>) in the input dataset. This produces a list of pairs (keyed by <code>k2</code>) for each call. After that, the MapReduce framework collects all pairs with the same key (<code>k2</code>) from all lists and groups them together, creating one group for each key. The ''Reduce'' function is then applied in parallel to each group, which in turn produces a collection of values in the same ___domain: <code>Reduce(k2, list (v2))</code> → <code>list((k3, v3))</code><ref>{{Cite web\|url=https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Inputs+and+Outputs\|title = MapReduce Tutorial}}</ref> Each ''Reduce'' call typically produces either one key value v3pair or an empty return, though one call is allowed to return more than one key value pair. The returns of all calls are collected as the desired result list. Thus the MapReduce framework transforms a list of (key, value) pairs into aanother list of ~~values~~(key, value) pairs.<ref>{{Cite web\|url=https://github.com/apache/hadoop-mapreduce/blob/307cb5b316e10defdbbc228d8cdcdb627191ea15/src/java/org/apache/hadoop/mapreduce/Reducer.java#L148\|title = Apache/Hadoop-mapreduce\|website = [[GitHub]]\|date = 31 August 2021}}</ref> This behavior is different from the typical functional programming map and reduce combination, which accepts a list of arbitrary values and returns one single value that combines ''all'' the values returned by map. It is [[Necessity and sufficiency\|necessary but not sufficient]] to have implementations of the map and reduce abstractions in order to implement MapReduce. Distributed implementations of MapReduce require a means of connecting the processes performing the Map and Reduce phases. This may be a [[distributed file system]]. Other options are possible, such as direct streaming from mappers to reducers, or for the mapping processors to serve up their results to reducers that query them. ===Examples=== The ~~prototypical~~canonical MapReduce example counts the appearance of each word in a set of documents:<ref>[{{cite web\|url=http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0004.html \|title=Example: Count word occurrences].\|publisher=Google Research~~.google.com.~~\|access-date=September ~~Retrieved on~~18, 2013~~-09-18.~~}}</ref> '''function''' <u>map</u>(String name, String document): ''// name: document name'' ''// document: document contents'' '''for each''' word w '''in''' document: emit (w, 1) '''function''' <u>reduce</u>(String word, Iterator partialCounts): ''// word: a word'' ''// partialCounts: a list of aggregated partial counts'' sum = 0 '''for each''' pc '''in''' partialCounts: sum += ~~ParseInt(~~pc) emit (word, sum) Here, each document is split into words, and each word is counted by the ''map'' function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to ''reduce''. Thus, this function just needs to sum all of its input values to find the total appearances of that word. Line 77: As another example, imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age. In [[SQL]], such a query could be expressed as: <~~source~~syntaxhighlight lang="sql"> SELECT age, AVG(contacts) FROM social.person GROUP BY age ORDER BY age </syntaxhighlight> ~~</source>~~ Using MapReduce, the ~~<tt>~~{{mono\|K1~~</tt>~~}} key values could be the integers 1 through 1100, each representing a batch of 1 million records, the ~~<tt>~~{{mono\|K2~~</tt>~~}} key value could be a ~~person’s~~person's age in years, and this computation could be achieved using the following functions: '''function''' Map '''is''' Line 105: '''end function''' Note that in the {{mono\|Reduce}} function, {{mono\|C}} is the count of people having in total N contacts, so in the {{mono\|Map}} function it is natural to write {{mono\|1=C=1}}, since every output pair is referring to the contacts of one single person. The MapReduce System would line up the 1100 Map processors, and would provide each with its corresponding 1 million input records. The Map step would produce 1.1 billion <tt>(Y,(N,1))</tt> records, with <tt>Y</tt> values ranging between, say, 8 and 103. The MapReduce System would then line up the 96 Reduce processors by performing shuffling operation of the key/value pairs due to the fact that we need average per age, and provide each with its millions of corresponding input records. The Reduce step would result in the much reduced set of only 96 output records <tt>(Y,A)</tt>, which would be put in the final result file, sorted by <tt>Y</tt>. The MapReduce system would line up the 1100 Map processors, and would provide each with its corresponding 1 million input records. The Map step would produce 1.1 billion {{mono\|(Y,(N,1))}} records, with {{mono\|Y}} values ranging between, say, 8 and 103. The MapReduce System would then line up the 96 Reduce processors by performing shuffling operation of the key/value pairs due to the fact that we need average per age, and provide each with its millions of corresponding input records. The Reduce step would result in the much reduced set of only 96 output records {{mono\|(Y,A)}}, which would be put in the final result file, sorted by {{mono\|Y}}. The count info in the record is important if the processing is reduced more than one time. If we did not add the count of the records, the computed average would be wrong, for example: Line 121 ⟶ 123: 10, 10 If we reduce files ~~<tt>~~{{mono\|#1~~</tt>~~}} and ~~<tt>~~{{mono\|#2~~</tt>~~}}, we will have a new file with an average of 9 contacts for a 10 -year -old person ((9+9+9+9+9)/5): ''-- reduce step #1: age, average of contacts'' 10, 9 If we reduce it with file ~~<tt>~~{{mono\|#3~~</tt>~~}}, we lose the count of how many records we've already seen, so we end up with an average of 9.5 contacts for a 10 -year -old person ((9+10)/2), which is wrong. The correct answer is 9.171<span style="text-decoration: overline;">66</span> = 55 / 6 = ((99×3+99×2+910×1)/(3+92+~~9+10)/6~~1). ==Dataflow== [[Software framework#Architecture\|Software framework architecture]] adheres to [[open-closed principle]] where code is effectively divided into unmodifiable ''frozen spots'' and [[extensibility\|extensible]] ''hot spots''. The frozen ~~part~~spot of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are: * an ''input reader'' * a ''Map'' function Line 138 ⟶ 140: ===Input reader=== The ''input reader'' divides the input into appropriate size 'splits' (in practice, typically, 64 MB to 128 MB) and the framework assigns one split to each ''Map'' function. The ''input reader'' reads data from stable storage (typically, a [[distributed file system]]) and generates key/value pairs. A common example will read a directory full of text files and return each line as a record. Line 150 ⟶ 152: Each ''Map'' function output is allocated to a particular ''reducer'' by the application's ''partition'' function for [[sharding]] purposes. The ''partition'' function is given the key and the number of reducers and returns the index of the desired ''reducer''. A typical default is to [[Hash function\|hash]] the key and use the hash value [[Modulo operation\|modulo]] the number of ''reducers''. It is important to pick a partition function that gives an approximately uniform distribution of data per shard for [[load balancing (computing)\|load-balancing]] purposes, otherwise the MapReduce operation can be held up waiting for slow reducers to finish (i.e. the reducers assigned ~~more~~the ~~than~~larger ~~their share~~shares of ~~data)~~the tonon-uniformly ~~finish~~partitioned data). Between the map and reduce stages, the data isare ''shuffled'' (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced itthem to the shard in which itthey will be reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations. ===Comparison function=== Line 164 ⟶ 166: ===Output writer=== The ''Output Writer'' writes the output of the ''Reduce'' to the stable storage~~, usually a [[distributed file system]]~~. ==Theoretical background== Properties of [[Monoid\|monoids]] are the basis for ensuring the validity of MapReduce operations.<ref>{{Cite journal \| doi = 10.1017/S0956796817000193 \| title = An algebra for distributed Big Data analytics \| journal = Journal of Functional Programming \| volume = 28 \| year = 2017 \| last = Fegaras \| first = Leonidas \| s2cid = 44629767 \| doi-access = }}</ref><ref>{{cite arXiv \|last=Lin \|first=Jimmy \|title=Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms \|eprint=1304.7544 \|date=29 Apr 2013\|class=cs.DC }}</ref> In the Algebird package<ref>{{Cite web \|title= Abstract Algebra for Scala \|url=https://twitter.github.io/algebird/}}</ref> a Scala implementation of Map/Reduce explicitly requires a monoid class type .<ref>{{Cite web \|title= Encoding Map-Reduce As A Monoid With Left Folding \|date= 5 September 2016\|url= http://erikerlandson.github.io/blog/2016/09/05/expressing-map-reduce-as-a-left-folding-monoid/}}</ref> The operations of MapReduce deal with two types: the type ''A'' of input data being mapped, and the type ''B'' of output data being reduced. The ''Map'' operation takes individual values of type ''A'' and produces, for each ''a:A'' a value ''b:B''; The ''Reduce'' operation requires a binary operation • defined on values of type ''B''; it consists of folding all available ''b:B'' to a single value. From a basic requirements point of view, any MapReduce operation must involve the ability to arbitrarily regroup data being reduced. Such a requirement amounts to two properties of the operation •: * associativity: (''x'' • ''y'') • ''z'' = ''x'' • (''y'' • ''z'') * existence of neutral element ''e'' such that ''e'' • ''x'' = ''x'' • ''e'' = ''x'' for every ''x:B''. The second property guarantees that, when parallelized over multiple nodes, the nodes that don't have any data to process would have no impact on the result. These two properties amount to having a [[monoid]] (''B'', •, ''e'') on values of type ''B'' with operation • and with neutral element ''e''. There's no requirements on the values of type ''A''; an arbitrary function ''A'' → ''B'' can be used for the ''Map'' operation. This means that we have a [[catamorphism]] ''A'' → (''B'', •, ''e''). Here ''A'' denotes a [[Kleene star]], also known as the type of lists over ''A''. The ''Shuffle'' operation per se is not related to the essence of MapReduce; it's needed to distribute calculations over the cloud. It follows from the above that not every binary ''Reduce'' operation will work in MapReduce. Here are the counter-examples: * building a tree from subtrees: this operation is not associative, and the result will depend on grouping; * direct calculation of averages: ''avg'' is also not associative (and it has no neutral element); to calculate an average, one needs to calculate [[Moment (mathematics)\|moments]]. ==Performance considerations== MapReduce programs are not guaranteed to be fast. The main benefit of this programming model is to exploit the optimized shuffle operation of the platform, and only having to write the ''Map'' and ''Reduce'' parts of the program. In practice, the author of a MapReduce program however has to take the shuffle step into consideration; in particular the partition function and the amount of data written by the ''Map'' function can have a large impact on the performance and scalability. Additional modules such as the ''Combiner'' function can help to reduce the amount of data written to disk, and transmitted over the network. MapReduce applications can achieve sub-linear speedups under specific circumstances.<ref name=":0">{{Cite journal\|title = BSP cost and scalability analysis for MapReduce operations\|journal = Concurrency and Computation: Practice and Experience\|date = 2015-01-01\|issn = 1532-0634\|pages = 2503–2527\|doi = 10.1002/cpe.3628\|first1 = Hermes\|last1 = Senger\|first2 = Veronica\|last2 = Gil-Costa\|first3 = Luciana\|last3 = Arantes\|first4 = Cesar A. C.\|last4 = Marcondes\|first5 = Mauricio\|last5 = Marín\|first6 = Liria M.\|last6 = Sato\|first7 = Fabrício A.B.\|last7 = da Silva\|volume=28\|issue = 8\|hdl = 10533/147670\|s2cid = 33645927\|hdl-access = free}}</ref> When designing a MapReduce algorithm, the author needs to choose a good tradeoff<ref name="ullman">{{~~cite~~Cite ~~doi~~journal \| doi = 10.1145/2331042.2331053 \| title = Designing good MapReduce algorithms\| journal = XRDS: Crossroads, the ACM Magazine for Students\| volume = 19\| pages = 30–34\| year = 2012\| last1 = Ullman \| first1 = J. D. \| s2cid = 26498063\| author-link1 = Jeffrey Ullman\| url = http://xrds.acm.org/article.cfm?aid=2331053 \|url-access=subscription}}</ref> between the computation and the communication costs. Communication cost often dominates the computation cost,<ref name="ullman"/><ref name=":0"/> and many MapReduce implementations are designed to write all communication to distributed storage for crash recovery. In tuning performance of MapReduce, the complexity of mapping, shuffle, sorting (grouping by the key), and reducing has to be taken into account. The amount of data produced by the mappers is a key parameter that shifts the bulk of the computation cost between mapping and reducing. Reducing includes sorting (grouping of the keys) which has nonlinear complexity. Hence, small partition sizes reduce sorting time, but there is a trade-off because having a large number of reducers may be impractical. The influence of split unit size is marginal (unless chosen particularly badly, say <1MB). The gains from some mappers reading load from local disks, on average, is minor.<ref>{{Cite journal\|title = Scheduling divisible MapReduce computations\|last1 = Berlińska\|first1 = Joanna\|date = 2010-12-01\|journal = Journal of Parallel and Distributed Computing\|doi = 10.1016/j.jpdc.2010.12.004\|last2 = Drozdowski\|first2 = Maciej\|volume=71\|issue = 3\|pages=450–459}}</ref> For processes that complete fast, and where the data fits into main memory of a single machine or a small cluster, using a MapReduce framework usually is not effective: since these frameworks are designed to recover from the loss of whole nodes during the computation, they write interim results to distributed storage. This crash recovery is expensive, and only pays off when the computation involves many computers and a long runtime of the computation - a task that completes in seconds can just be restarted in the case of an error; and the likelihood of at least one machine failing grows quickly with the cluster size. On such problems, implementations keeping all data in memory and simply restarting a computation on node failures, or - when the data is small enough - non-distributed solutions will often be faster than a MapReduce system. For processes that complete quickly, and where the data fits into main memory of a single machine or a small cluster, using a MapReduce framework usually is not effective. Since these frameworks are designed to recover from the loss of whole nodes during the computation, they write interim results to distributed storage. This crash recovery is expensive, and only pays off when the computation involves many computers and a long runtime of the computation. A task that completes in seconds can just be restarted in the case of an error, and the likelihood of at least one machine failing grows quickly with the cluster size. On such problems, implementations keeping all data in memory and simply restarting a computation on node failures or —when the data is small enough— non-distributed solutions will often be faster than a MapReduce system. ==Distribution and reliability== Line 182 ⟶ 234: ==Uses== MapReduce is useful in a wide range of applications, including distributed pattern-based searching, distributed sorting, web link-graph reversal, Singular Value Decomposition,<ref>{{cite ~~journal~~web \|last1=Bosagh Zadeh\|first1=Reza\|last2=Carlsson\|first2=Gunnar\|title=Dimension Independent Matrix Square Using MapReduce \|website=Stanford University \|url=~~http~~https://stanford.edu/~rezab/papers/dimsum.pdf\|~~accessdate~~access-date=12 July 2014\|bibcode=2013arXiv1304.1467B\|year=2013\|arxiv=1304.1467}}</ref> web access log stats, [[inverted index]] construction, [[document clustering]], [[machine learning]],<ref name="mrml">{{cite web\| \|url=http://www.willowgarage.com/map-reduce-machine-learning-multicore\| \|title=Map-Reduce for Machine Learning on Multicore \|first1=Andrew ~~author~~Y. \|last1=Ng \|first2=Gary \|last2=Bradski \|first3=[[Cheng-Tao \|last3=Chu]] \|first4=Kunle ~~coauthors~~\|last4=Olukotun \|first5=Sang Kyun \|last5=Kim, \|first6=Yi-An \|last6=Lin, \|first7=YuanYuan \|last7=Yu, ~~Gary~~\|publisher=NIPS ~~Bradski,~~2006 ~~Andrew~~\|year=2006 ~~Ng,~~\|access-date=2009-11-24 ~~and~~\|archive-date=2010-06-20 ~~[[Kunle Olukotun]]~~\|archive-url=https://web.archive.org/web/20100620092743/http://www.willowgarage.com/map-reduce-machine-learning-multicore ~~publisher~~\|url-status=~~NIPS~~dead ~~2006~~}}</ref> and [[statistical machine translation]]. Moreover, the MapReduce model has been adapted to several computing environments like multi-core and many-core systems,<ref name="evalMR">{{~~cite~~Cite book ~~doi~~\| doi = 10.1109/HPCA.2007.346181\| chapter = Evaluating MapReduce for Multi-core and Multiprocessor Systems\| title = 2007 IEEE 13th International Symposium on High Performance Computer Architecture\| pages = 13\| year = 2007\| last1 = Ranger \| first1 = C. \| last2 = Raghuraman \| first2 = R. \| last3 = Penmetsa \| first3 = A. \| last4 = Bradski \| first4 = G. \| last5 = Kozyrakis \| first5 = C. \| isbn = 978-1-4244-0804-7\| citeseerx = 10.1.1.220.8210\| s2cid = 12563671}}</ref><ref name="graphicsMR">{{~~cite~~Cite book ~~doi~~\| doi = 10.1145/1454115.1454152\| chapter = Mars: a MapReduce framework on graphics processors\| title = Proceedings of the 17th international conference on Parallel architectures and compilation techniques – PACT '08\| pages = 260\| year = 2008\| last1 = He \| first1 = B. \| last2 = Fang \| first2 = W. \| last3 = Luo \| first3 = Q. \| last4 = Govindaraju \| first4 = N. K. \| chapter-url = http://wenbin.org/doc/papers/Wenbin08PACT.pdf\| last5 = Wang \| first5 = T. \| isbn = 9781605582825\| s2cid = 207169888}}</ref><ref name="tiledMR">{{~~cite~~Cite book ~~doi~~\| doi = 10.1145/1854273.1854337\| chapter = Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling\| title = Proceedings of the 19th international conference on Parallel architectures and compilation techniques – PACT '10\| pages = 523\| year = 2010\| last1 = Chen \| first1 = R. \| last2 = Chen \| first2 = H. \| last3 = Zang \| first3 = B. \| isbn = 9781450301787\| s2cid = 2082196}}</ref> desktop grids,<ref name="gridMR">{{~~cite~~Cite book ~~doi~~\| doi = 10.1109/3PGCIC.2010.33~~}}</ref>~~\| ~~volunteer~~chapter ~~computing~~= ~~environments~~Towards MapReduce for Desktop Grid Computing\| title = 2010 International Conference on P2P,~~<ref~~ ~~name~~Parallel, Grid, Cloud and Internet Computing\| pages =~~"volunteerMR">{{cite~~ ~~doi~~193\|10 year = 2010\| last1 = Tang \| first1 = B.~~1145/1851476~~ \| last2 = Moca \| first2 = M.~~1851489}}</ref>~~ ~~dynamic~~\| ~~cloud~~last3 ~~environments,<ref~~= Chevalier \| first3 ~~name~~=~~"dynCloudMR">{{cite~~ ~~doi~~S. \|10 last4 = He \| first4 = H.~~1016~~ \| last5 = Fedak \| first5 = G. \| chapter-url = http:/j/graal.~~jcss~~ens-lyon.~~2011~~fr/~gfedak/papers/xtremmapreduce.123pgcic10.~~021}}</ref>~~pdf\| ~~and~~isbn ~~mobile~~= ~~environments.<ref~~978-1-4244-8538-3\| citeseerx ~~name~~=~~"mobileMR">{{cite~~ ~~doi\|~~10.~~1145/1839294~~1.~~1839332~~1.671.2763\| s2cid = 15044391}}</ref> multi-cluster,<ref name="HMR">{{Cite book \| doi = 10.1145/1996023.1996026\| chapter = A Hierarchical Framework for Cross-Domain MapReduce Execution\|chapter-url = http://yuanluo.net/publications/LUO_ECMLS2011.pdf\| title = Proceedings of the second international workshop on Emerging computational methods for the life sciences (ECMLS '11)\| year = 2011\| last1 = Luo \| first1 = Y. \| last2 = Guo \| first2 = Z. \| last3 = Sun \| first3 = Y.\| last4 = Plale \| first4 = B. \|author4-link=Beth Plale\| last5 = Qiu \| first5 = J. \| last6=Li\| first6=W. \|isbn = 978-1-4503-0702-4\| citeseerx = 10.1.1.364.9898\| s2cid = 15179363}}</ref> volunteer computing environments,<ref name="volunteerMR">{{Cite book \| doi = 10.1145/1851476.1851489\| chapter = MOON: MapReduce On Opportunistic eNvironments\| title = Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing – HPDC '10\| pages = 95\| year = 2010\| last1 = Lin \| first1 = H. \| last2 = Ma \| first2 = X. \| last3 = Archuleta \| first3 = J. \| last4 = Feng \| first4 = W. C. \| last5 = Gardner \| first5 = M. \| last6 = Zhang \| first6 = Z. \| chapter-url = http://eprints.cs.vt.edu/archive/00001089/01/moon.pdf\| isbn = 9781605589428\| s2cid = 2351790}}</ref> dynamic cloud environments,<ref name="dynCloudMR">{{Cite journal\| doi = 10.1016/j.jcss.2011.12.021\| title = P2P-MapReduce: Parallel data processing in dynamic Cloud environments\| journal = [[Journal of Computer and System Sciences]]\| volume = 78\| issue = 5\| pages = 1382–1402\| year = 2012\| last1 = Marozzo\| first1 = F.\| last2 = Talia\| first2 = D.\| last3 = Trunfio\| first3 = P.\| doi-access = free}}</ref> mobile environments,<ref name="mobileMR">{{Cite book \| doi = 10.1145/1839294.1839332\| chapter = Misco: a MapReduce framework for mobile systems\| title = Proceedings of the 3rd International Conference on PErvasive Technologies Related to Assistive Environments – PETRA '10\| pages = 1\| year = 2010\| last1 = Dou \| first1 = A. \| last2 = Kalogeraki \| first2 = V. \| last3 = Gunopulos \| first3 = D. \| last4 = Mielikainen \| first4 = T. \| last5 = Tuulos \| first5 = V. H. \| isbn = 9781450300711\| s2cid = 14517696}}</ref> and high-performance computing environments.<ref>{{cite book\|chapter=Characterization and Optimization of Memory-Resident MapReduce on HPC Systems\|publisher=IEEE\|date=May 2014\|doi=10.1109/IPDPS.2014.87\|isbn=978-1-4799-3800-1\|title=2014 IEEE 28th International Parallel and Distributed Processing Symposium\|last1=Wang\|first1=Yandong\|last2=Goldstone\|first2=Robin\|last3=Yu\|first3=Weikuan\|last4=Wang\|first4=Teng\|pages=799–808\|s2cid=11157612}}</ref> At Google, MapReduce was used to completely regenerate Google's index of the [[World Wide Web]]. It replaced the old ''ad hoc'' programs that updated the index and ran the various analyses.<ref name="usage">{{cite web\| quote=As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes.\| url=http://www.baselinemag.com/~~article2~~c/~~0,1540,1985048,00.asp~~a/Infrastructure/How-Google-Works-1/5\| title=How Google Works\| date=7 July 2006\| publisher=baselinemag.com}}</ref> Development at Google has since moved on to technologies such as Percolator, ~~Flume~~FlumeJava<ref name="Chambers2010">{{cite book \|last1=Chambers \|first1=Craig \|last2=Raniwala \|first2=Ashish \|last3=Perry \|first3=Frances \|last4=Adams \|first4=Stephen \|last5=Henry \|first5=Robert R. \|last6=Bradshaw \|first6=Robert \|last7=Weizenbaum \|first7=Nathan \|title=Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation \|chapter=FlumeJava \|date=1 January 2010 \|pages=363–375 \|doi=10.1145/1806596.1806638 \|url=https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35650.pdf \|access-date=4 August 2016 \|isbn=9781450300193 \|s2cid=14888571 \|archive-url=https://web.archive.org/web/20160923141630/https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35650.pdf \|archive-date=23 September 2016 }}</ref> and [[Google MillWheel\|MillWheel]] that offer streaming operation and updates instead of batch processing, to allow integrating "live" search results without rebuilding the complete index.<ref>Peng, D., & Dabek, F. (2010, October). Large-scale Incremental Processing Using Distributed Transactions and Notifications. In OSDI (Vol. 10, pp. 1-15).</ref> MapReduce's stable inputs and outputs are usually stored in a [[distributed file system]]. The transient data isare usually stored on local disk and fetched remotely by the reducers. ==Criticism== ===Lack of novelty=== [[David DeWitt]] and [[Michael Stonebraker]], computer scientists specializing in [[parallel database]]s and [[shared-nothing architecture]]s, have been critical of the breadth of problems that MapReduce can be used for.<ref name="shark">{{cite web\| url=http://typicalprogrammer.com/~~?p=16~~relational-database-experts-jump-the-mapreduce-shark\| title=Database Experts Jump the MapReduce Shark}}</ref> They called its interface too low-level and questioned whether it really represents the [[paradigm shift]] its proponents have claimed it is.<ref name="ddandms1">{{cite web\| url=~~http~~https://craig-henderson.blogspot.com/2009/11/dewitt-and-stonebrakers-mapreduce-major.html\| title=MapReduce: A major step backwards\| author=[[David DeWitt]]\|author2=[[Michael Stonebraker]] \| publisher=craig-henderson.blogspot.com\| ~~accessdate~~access-date=2008-08-27\| author2-link=Michael Stonebraker\| author-link=David DeWitt}}</ref> They challenged the MapReduce proponents' claims of novelty, citing [[Teradata]] as an example of [[prior art]] that has existed for over two decades. They also compared MapReduce programmers to [[CODASYL~~\|Codasyl~~]] programmers, noting both are "writing in a [[Low-level programming language\|low-level language]] performing low-level record manipulation."<ref name="ddandms1"/> MapReduce's use of input files and lack of [[Logical schema\|schema]] support prevents the performance improvements enabled by common database system features such as [[B-tree]]s and [[Partition (database)\|hash partitioning]], though projects such as [[Pig (programming language)\|Pig (or PigLatin)]], [[Sawzall (programming language)\|Sawzall]], [[Apache Hive]],<ref name="ApacheHiveWiki">{{cite web\| url=https://cwiki.apache.org/confluence/display/Hive/Home\| title=Apache Hive -– Index of -– Apache Software Foundation}}</ref> [http://ysmart.cse.ohio-state.edu/ YSmart],<ref name="YSmartPaper">{{cite web\| url=http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf\| title=YSmart: Yet Another SQL-to-MapReduce Translator\| author=Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He and Xiaodong Zhang\|format=PDF}}</ref> [[HBase]]<ref name="HBase">{{cite web\| url=http://hbase.apache.org/\| title=HBase -– HBase Home -– Apache Software Foundation}}</ref> and [[~~BigTable~~Bigtable]]<ref name="HBase"/><ref name="~~BigTablePaper~~BigtablePaper">{{cite web\| url=http://research.google.com/archive/bigtable-osdi06.pdf\| title=Bigtable: A Distributed Storage System for Structured Data~~\| format=PDF~~}}</ref> are addressing some of these problems. Greg Jorgensen wrote an article rejecting these views.<ref name="gj1">{{cite web\| url=http://typicalprogrammer.com/~~?p=16~~relational-database-experts-jump-the-mapreduce-shark\| title=Relational Database Experts Jump The MapReduce Shark\| author=[[Greg Jorgensen]] \| publisher=typicalprogrammer.com\| ~~accessdate~~access-date=2009-11-11\| author-link=Greg Jorgensen}}</ref> Jorgensen asserts that DeWitt and Stonebraker's entire analysis is groundless as MapReduce was never designed nor intended to be used as a database. DeWitt and Stonebraker have subsequently published a detailed benchmark study in 2009 comparing performance of [[Hadoop\|Hadoop's]] MapReduce and [[RDBMS]] approaches on several specific problems.<ref name="sigmod">{{cite web\| url=~~http~~https://database.cs.brown.edu/projects/mapreduce-vs-dbms/\| title=A Comparison of Approaches to Large-Scale Data Analysis\| ~~author~~first1=Andrew ~~Pavlo~~\|last1=Pavlo ~~coauthors~~\|first2=E.Erik \|last2=Paulson, A.\|first3=Alexander \|last3=Rasin, D.\|first4=Daniel J. \|last4=Abadi, ~~[[David DeWitt~~\|D.first5=Deavid J. ~~Dewitt]],~~\|last5=DeWitt S.\|first6=Samuel \|last6=Madden, ~~and [[~~\|first7=Michael ~~Stonebraker~~\|M. last7=Stonebraker]]\| publisher=Brown University\| ~~accessdate~~access-date=2010-01-11}}</ref> They concluded that relational databases offer real advantages for many kinds of data use, especially on complex processing or where the data is used across an enterprise, but that MapReduce may be easier for users to adopt for simple or one-time processing tasks. The MapReduce programming paradigm was also described in [[Danny Hillis]]'s 1985 thesis <ref name="WDHmit86">{{cite book \|author-first=W. Danny \|author-last=Hillis \|date=1986 \|title=The Connection Machine \|publisher=[[MIT Press]] \|isbn=0262081571 \|url-access=registration \|url=https://archive.org/details/connectionmachin00hill }}</ref> intended for use on the [[Connection Machine]], where it was called "xapping/reduction"<ref>{{cite web \|url=http://bitsavers.trailing-edge.com/pdf/thinkingMachines/CM2/HA87-4_Connection_Machine_Model_CM-2_Technical_Summary_Apr1987.pdf \|title=Connection Machine Model CM-2 Technical Summary \|author=<!--Not stated--> \|date=1987-04-01 \|publisher=[[Thinking Machines Corporation]] \|access-date=2022-11-21}}</ref> and relied upon that machine's special hardware to accelerate both map and reduce. The dialect ultimately used for the Connection Machine, the 1986 [[StarLisp]], had parallel <code>map</code> and <code>reduce!!</code>,<ref>{{cite web \|url=https://www.softwarepreservation.org/projects/LISP/starlisp/supplement-to-the-starlisp-reference-manual-version-5-0.pdf \|title=Supplement to the Lisp Reference Manual \|author=<!--Not stated--> \|date=1988-09-01 \|publisher=[[Thinking Machines Corporation]] \|access-date=2022-11-21}}</ref> which in turn was based on the 1984 [[Common Lisp]], which had non-parallel <code>map</code> and <code>reduce</code> built in.<ref>{{cite web \|url=https://collections.lib.utah.edu/dl_files/20/2e/202ebf04b52d043c78297444bc9bc4fbc17b6b5e.pdf \|title=Rediflow Architecture Prospectus \|author=<!--Not stated--> \|date=1986-04-05 \|publisher=[[University of Utah School of Computing\|University of Utah Department of Computer Science]] \|access-date=2022-11-21}}</ref> The [[Fold (higher-order function)#Linear vs. tree-like folds\|tree-like]] approach that the Connection Machine's [[Hypercube internetwork topology\|hypercube architecture]] uses to execute <code>reduce</code> in <math>O(\log n)</math> time<ref>{{cite book \|url=https://www.cise.ufl.edu/~sahni/papers/imagemono.pdf#page=20 \|title=Hypercube Algorithms for Image Processing and Pattern Recognition \|last=Ranka \|first=Sanjay \|date=1989 \|access-date=2022-12-08 \|section=2.6 Data Sum \|publisher=University of Florida}}</ref> is effectively the same as the approach referred to within the Google paper as prior work.{{r\|GoogleMapReduce\|p=11\|q=an associative function can be computed over all prefixes of an N element array in log N time on N processors using parallel prefix computations. MapReduce can be considered a simplification and distillation of some of these models}} Google has been granted a patent on MapReduce.<ref name="patent">[http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331 US Patent 7,650,331: "System and method for efficient large-scale data processing "]</ref> However, there have been claims that this patent should not have been granted because MapReduce is too similar to existing products. For example, map and reduce functionality can be very easily implemented in [[Oracle database\|Oracle's]] [[PL/SQL]] database oriented language<ref name="Curt Monash">{{cite web\| url=http://www.dbms2.com/2010/02/11/google-mapreduce-patent/\| title=More patent nonsense — Google MapReduce\| author=[[Curt Monash]] \| publisher=dbms2.com\| accessdate=2010-03-07}}</ref> or is transparently for developers supported in distributed database architectures such as [[Clusterpoint]] XML database<ref name="Clusterpoint">{{cite web\| url=http://www.clusterpoint.com\| title=Clusterpoint XML database\| publisher=clusterpoint.com}}</ref> or [[MongoDB]] NoSQL database.<ref name="MongoDB">{{cite web\| url=http://www.mongodb.org \| title=MongoDB NoSQL database\| publisher=10gen.com}}</ref> In 2010 Google was granted what is described as a patent on MapReduce. The patent, filed in 2004, may cover use of MapReduce by open source software such as [[Hadoop]], [[CouchDB]], and others. In ''[[Ars Technica]]'', an editor acknowledged Google's role in popularizing the MapReduce concept, but questioned whether the patent was valid or novel.<ref>{{cite news \|last1=Paul \|first1=Ryan \|title=Google's MapReduce patent: what does it mean for Hadoop? \|url=https://arstechnica.com/information-technology/2010/01/googles-mapreduce-patent-what-does-it-mean-for-hadoop/ \|access-date=21 March 2021 \|work=Ars Technica \|date=20 January 2010 \|language=en-us}}</ref><ref name="patent">{{cite web\|url=http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331\|title=United States Patent: 7650331 - System and method for efficient large-scale data processing\|website=uspto.gov\|access-date=2010-01-19\|archive-date=2013-09-21\|archive-url=https://web.archive.org/web/20130921164908/http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331\|url-status=dead}}</ref> In 2013, as part of its "Open Patent Non-Assertion (OPN) Pledge", Google pledged to only use the patent defensively.<ref>{{cite news \|last1=Nazer \|first1=Daniel \|title=Google Makes Open Patent Non-assertion Pledge and Proposes New Licensing Models \|url=https://www.eff.org/deeplinks/2013/03/google-makes-open-patent-non-assertion-pledge \|access-date=21 March 2021 \|work=Electronic Frontier Foundation \|date=28 March 2013 \|language=en}}</ref><ref>{{cite news \|last1=King \|first1=Rachel \|title=Google expands open patent pledge to 79 more about data center management \|url=https://www.zdnet.com/article/google-expands-open-patent-pledge-to-79-more-about-data-center-management/ \|access-date=21 March 2021 \|work=ZDNet \|date=2013 \|language=en}}</ref> The patent is expected to expire on 23 December 2026.<ref>{{cite web \|title=System and method for efficient large-scale data processing \|url=https://patents.google.com/patent/US7650331B1/en \|publisher=Google Patents Search \|access-date=21 March 2021 \|language=en \|date=18 June 2004}}</ref> ===Restricted programming framework=== MapReduce tasks must be written as acyclic dataflow programs, i.e. a stateless mapper followed by a stateless reducer, that are executed by a batch job scheduler. This paradigm makes repeated querying of datasets difficult and imposes limitations that are felt in fields such as [[~~machine~~Graph ~~learning~~(abstract data type)\|graph]] processing<ref>{{cite conference \|url=https://csc.csudh.edu/btang/seminar/papers/BigD399.pdf \|title=Map-Based Graph Analysis on MapReduce \|last1=Gupta \|first1=Upa \|last2=Fegaras \|first2=Leonidas \|date=2013-10-06 \|publisher=[[IEEE]] \|book-title=Proceedings: 2013 IEEE International Conference on Big Data \|pages=24–30 \|___location=[[Santa Clara, California]] \|conference=2013 IEEE International Conference on Big Data}}</ref> where iterative algorithms that revisit a single [[working set]] multiple times are the norm, as well as, in the presence of [[Hard disk drive\|disk]]-based data with high [[Latency (engineering)#Mechanics\|latency]], even the field of [[machine learning]] where multiple passes through the data are required even though algorithms can tolerate serial access to the data each pass.<ref>{{cite conference \|first1=Matei\| last1=Zaharia\| first2=Mosharaf \|last2=Chowdhury\| first3=Michael\| last3=Franklin\| first4=Scott\| last4=Shenker\| first5=Ion\| last5=Stoica \|title=Spark: Cluster Computing with Working Sets \|~~titlelink~~url=https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Spark ~~(cluster computing framework)~~-Cluster-Computing-with-Working-Sets.pdf \|conference=HotCloud 2010\|date=June 2010}}</ref> ~~==Conferences and users groups==~~ * [http://graal.ens-lyon.fr/mapreduce/ The First International Workshop on MapReduce and its Applications (MAPREDUCE'10) ] was held in June 2010 with the HPDC conference and OGF'29 meeting in Chicago, IL. * [http://mapreduce.meetup.com/ MapReduce Users Groups ] around the world. ==See also== * [[Bird–Meertens formalism]] * [[Parallelization contract]] ===Implementations of MapReduce=== * [[~~Couchdb~~Apache CouchDB]] * [[Apache Hadoop]] * [[Hadoop]], [[Apache Software Foundation\|Apache]]'s free and open source implementation of MapReduce * [[Infinispan]] * [[MongoDB]] - A [[scalable]], high-performance, [[open source]] [[NoSQL]] [[database]] * [[Riak]] ~~===Related concepts and software===~~ * [[Algorithmic skeleton]] - A high-level parallel programming model for parallel and distributed computing * [[Apache Accumulo]] - Secure Big Table * [[Apache Cassandra]] - A column-oriented database that supports access from Hadoop * [[Big data]] * [[Cloud computing]] * [[Clusterpoint]] - A scalable, high-performance, commercial software XML database * [[Datameer]] Analytics Solution (DAS) - data source integration, storage, analytics engine and visualization * [[Divide and conquer algorithm]] * [[Fork–join model]] * [[HBase]] - [[BigTable]]-model database * [[HPCC]] - [[LexisNexis]] Risk Solutions High Performance Computing Cluster * [[Hypertable]] - HBase alternative * [[Nutch]] - An effort to build an open source search engine based on [[Lucene]] and Hadoop, also created by Doug Cutting * Osprey, a fault tolerant MapReduce like system proposed in [http://db.csail.mit.edu/pubs/paper.pdf Osprey: Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing Distributed Database] * [[parallelization contract]] * [[Pentaho]] - Open source data integration (Kettle), analytics, reporting, visualization and predictive analytics directly from Hadoop nodes * [[Pig (programming language)\|Apache Pig]] A language and compiler to generate Hadoop programs * [[Programming with Big Data in R]] * [[Sector/Sphere]] - Open source distributed storage and processing ==References== {{reflist\|30em}} ~~Specific references:~~ ~~{{Reflist\|3}}~~ ~~General references:~~ ~~{{Refbegin}}~~ * Dean, Jeffrey & Ghemawat, Sanjay (2004). [http://research.google.com/archive/mapreduce.html "MapReduce: Simplified Data Processing on Large Clusters"]. Retrieved Nov. 23, 2011. * Matt WIlliams (2009). [http://wordflows.com/matt/2009/01/18/understanding-mapreduce/ "Understanding Map-Reduce"]. Retrieved Apr. 13, 2011. ~~{{Refend}}~~ ~~==External links==~~ ~~{{Commons category}}~~ * [http://mapreduce.sandia.gov/index.html MapReduce-MPI] MapReduce-MPI Library ~~; Papers~~ ~~{{Refbegin}}~~ [http://www.researchgate.net/publication/259226804_A_MapReduce_based_distributed_SVM_algorithm_for_binary_classification "CloudSVM: Training an SVM Classifier in Cloud Computing Systems"]-paper by F. Ozgur Catak, M. Erdal Balaban, Springer, Lecture Notes in Computer Science,Pervasive Computing and Networked World 2012 from [[TÜBİTAK]] and [[Istanbul University]] [http://dl.acm.org/citation.cfm?id=1996023.1996026 "A Hierarchical Framework for Cross-Domain MapReduce Execution"] — paper by Yuan Luo, Zhenhua Guo, Yiming Sun, Beth Plale, Judy Qiu; from [[Indiana University]] and Wilfred Li; from [[University of California, San Diego]] * [http://research.google.com/archive/sawzall.html "Interpreting the Data: Parallel Analysis with Sawzall"] — paper by Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan; from [[Google Labs]] * [http://csl.stanford.edu/%7Echristos/publications/2007.cmp_mapreduce.hpca.pdf "Evaluating MapReduce for Multi-core and Multiprocessor Systems"] — paper by Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis; from [[Stanford University]] * [http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing/ "Why MapReduce Matters to SQL Data Warehousing"] — analysis related to the August, 2008 introduction of MapReduce/SQL integration by [[Aster Data Systems]] and [[Greenplum]] * [http://pages.cs.wisc.edu/~dekruijf/docs/mapreduce-cell.pdf "MapReduce for the Cell B.E. Architecture"] — paper by Marc de Kruijf and Karthikeyan Sankaralingam; from [[University of Wisconsin–Madison]] * [http://www.cse.ust.hk/catalac/users/saven/GPGPU/MapReduce/PACT08/171.pdf "Mars: A MapReduce Framework on Graphics Processors"] — paper by Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, Tuyong Wang; from [[Hong Kong University of Science and Technology]]; published in Proc. PACT 2008. It presents the design and implementation of MapReduce on graphics processors. * [http://www.springerlink.com/content/h17r882710314147/ "A Peer-to-Peer Framework for Supporting MapReduce Applications in Dynamic Cloud Environments"] — paper by Fabrizio Marozzo, Domenico Talia, Paolo Trunfio; from [[University of Calabria]]; published in Cloud Computing: Principles, Systems and Applications, N. Antonopoulos, L. Gillam (Editors), chapt. 7, pp. 113–125, Springer, 2010, ISBN 978-1-84996-240-7. * [http://portal.acm.org/citation.cfm?doid=1247480.1247602 "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters"] — paper by Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker; from [[Yahoo]] and [[UCLA]]; published in Proc. of ACM SIGMOD, pp. 1029–1040, 2007. (This paper shows how to extend MapReduce for relational data processing.) * FLuX: the [http://citeseer.ist.psu.edu/647742.html Fault-tolerant], [http://citeseer.ist.psu.edu/546646.html Load Balancing] eXchange operator from [[UC Berkeley]] provides an integration of partitioned parallelism with process pairs. This results in a more pipelined approach than Google's MapReduce with instantaneous failover, but with additional implementation cost. * [http://infolab.stanford.edu/~ullman/pub/mapred.pdf "A New Computation Model for Rack-Based Computing"] — paper by Foto N. Afrati; Jeffrey D. Ullman; from [[Stanford University]]; Not published as of Nov 2009. This paper is an attempt to develop a general model in which one can compare algorithms for computing in an environment similar to what map-reduce expects. * [http://portal.acm.org/beta/citation.cfm?id=1723112.1723129 FPMR: MapReduce framework on FPGA]—paper by Yi Shan, Bo Wang, Jing Yan, Yu Wang, Ningyi Xu, Huazhong Yang (2010), in FPGA '10, Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays. * [http://ipads.se.sjtu.edu.cn/lib/exe/fetch.php?media=publications:ostrich-pact10.pdf "Tiled-MapReduce: Optimizing Resource Usages of Data-parallel Applications on Multicore with Tiling"]—paper by Rong Chen, Haibo Chen and Binyu Zang from [[Fudan University]]; published in Proc. PACT 2010. It presents the Tiled-MapReduce programming model which optimizes resource usages of MapReduce applications on multicore environment using tiling strategy. * [http://ipads.se.sjtu.edu.cn/lib/exe/fetch.php?media=publications:ostrich-taco13.pdf "Tiled MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling"]—paper by Rong Chen, and Haibo Chen from [[Shanghai Jiao Tong University]]; published in ACM TACO, 10(1), 2013. It extends the earlier version of Ostrich to support several usage scenarios such as online and incremental computing on multicore machines. * [http://dx.doi.org/10.1016/j.jpdc.2010.12.004 "Scheduling divisible MapReduce computations "]—paper by Joanna Berlińska from [[Adam Mickiewicz University]] and Maciej Drozdowski from [[Poznan University of Technology]]; Journal of Parallel and Distributed Computing 71 (2011) 450-459, {{doi\|10.1016/j.jpdc.2010.12.004}}. It presents scheduling and performance model of MapReduce. * [http://stratosphere.eu/files/NephelePACTs_10.pdf "Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing"]—paper by D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke from [http://www.tu-berlin.de/menue/home/parameter/en/ TU Berlin] published in Proc. of ACM SoCC 2010. The paper introduces the PACT programming model, a generalization of MapReduce, developed in the [http://www.stratosphere.eu Stratosphere] research project. * [http://stratosphere.eu/files/ComparingMapReduceAndPACTs_11.pdf "MapReduce and PACT - Comparing Data Parallel Programming Models"]—paper by A. Alexandrov, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Nijkamp, and D. Warneke from [http://www.tu-berlin.de/menue/home/parameter/en/ TU Berlin] published in Proc. of BTW 2011. ~~{{Refend}}~~ ~~; Books~~ ~~{{Refbegin}}~~ * Jimmy Lin and Chris Dyer. [http://www.umiacs.umd.edu/~jimmylin/book.html "Data-Intensive Text Processing with MapReduce"] (manuscript) ~~{{Refend}}~~ ~~;Educational courses~~ ~~{{Refbegin}}~~ [http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html Cluster Computing and MapReduce] course from [http://code.google.com/edu/ Google Code University] contains video lectures and related course materials from a series of lectures that was taught to Google software engineering interns during the Summer of 2007. [http://code.google.com/edu/submissions/mapreduce/listing.html MapReduce in a Week] course from [http://code.google.com/edu/ Google Code University] contains a comprehensive introduction to MapReduce including lectures, reading material, and programming assignments. * [http://mr.iap.2008.googlepages.com/ MapReduce course], taught by engineers of [[Google]] Boston, part of 2008 Independent Activities Period at [[MIT]]. ~~{{Refend}}~~ {{Commons category\|MapReduce}} ~~; Bibliography~~ ~~{{Refbegin}}~~ * [http://www.columbia.edu/~ak2834/mapreduce.html MapReduce bibliography by A. Kamil, 2010] ~~{{Refend}}~~ {{Google ~~Inc.~~LLC}} {{Authority control}} {{DEFAULTSORT:Mapreduce}} Line 295 ⟶ 281: [[Category:Parallel computing]] [[Category:Distributed computing architecture]] [[Category:Articles with example code]]