Parallelization contract: Difference between revisions

Content deleted Content added
Rescuing 2 sources and tagging 0 as dead. #IABot (v1.6.4)
Successfully de-orphaned! Wikiproject Orphanage: You can help!
 
(4 intermediate revisions by 4 users not shown)
Line 1:
{{Orphanoriginal research|date=DecemberMay 20132023}}
 
The '''parallelization contract''' or '''PACT''' programming model is a generalization of the [[MapReduce]] [[programming model]] and uses [[Higher-order function|second order functions]] to perform concurrent computations on large ([[Petabyte]]s) data sets in parallel.
Line 22:
 
Optionally, the user code can be annotated with additional information. These annotations disclose some information on the behavior of the black-box user function. The [[PactCompiler|PACT Compiler]] can utilize the information to obtain more efficient execution plans. However, while a missing annotation will not change the result of the execution, an incorrect Output Contract produces wrong results.
 
{{:wiki:pact.png?nolink&600|}}
 
The currently supported Input Contracts and annotation are presented and discussed in the following.
Line 38 ⟶ 36:
== MAP ==
 
The Map Input Contract works in the same way as in MapReduce. It has a single input and assigns each input record to its own subset. Hence, all records are processed independently from each other (see figure below). \\
 
{{:wiki:map.png?nolink&100|}}
 
== REDUCE ==
 
The Reduce Input Contract has the same semantics as the reduce function in MapReduce. It has a single input and groups together all records that have identical key fields. Each of these groups is handed as a whole to the user code and processed by it (see figure below). The PACT Programming Model does also support optional Combiners, e.g. for partial aggregations. \\
 
{{:wiki:reduce.png?nolink&100|}}
 
== CROSS ==
 
The Cross Input Contract works on two inputs. It builds the Cartesian product of the records of both inputs. Each element of the Cartesian product (pair of records) is handed to the user code. \\
 
{{:wiki:cross.png?nolink&160|}}
 
== MATCH ==
 
The Match Input Contract works on two inputs. From both inputs it matches those records that are identical on their key fields come from different inputs. Hence, it resembles an equality join where the keys of both inputs are the attributes to join on. Each matched pair of records is handed to the user code. \\
 
{{:wiki:match.png?nolink&160|}}
 
== COGROUP ==
 
The CoGroup Input Contract works on two inputs as well. It can be seen as a Reduce on two inputs. On each input, the records are grouped by key (such as Reduce does) and handed to the user code. In contrast to Match, the user code is also called for a key if only one input has a pair with it (see blue key in example below).
 
{{:wiki:cogroup.png?nolink&200|}}
 
=== Pact Record Data Model ===
Line 80 ⟶ 69:
== Constant Fields ==
 
The **'''Constant Fields**''' annotation marks fields that are not modified by the user code function. Note that for every input record a constant field may not change its content and position in any output record! In case of binary second-order functions such as Cross, Match, and CoGroup, the user can specify one annotation per input.
 
== Constant Fields Except ==
 
The **'''Constant Fields Except**''' annotation is inverse to the **Constant Fields** annotation. It annotates all fields which might be modified by the annotated user-function, hence the optimizer considers **any not annotated field as constant**. This annotation should be used very carefully! Again, for binary second-order functions (Cross, Match, CoGroup), one annotation per input can be defined. Note that either the Constant Fields or the Constant Fields Except annotation may be used for an input.
 
=== PACT Programs ===
 
PACT programs are constructed as data flow graphs that consist of data sources, PACTs, and data sinks. One or more data sources read files that contain the input data and generate records from those files. Those records are processed by one or more PACTs, each consisting of an Input Contract, user code, and optional code annotations. Finally, the results are written back to output files by one or more data sinks. In contrast to the MapReduce programming model, a PACT program can be arbitrary complex and has no fixed structure. \\
 
The figure below shows a PACT program with two data sources, four PACTs, and one data sink. Each data source reads data from a specified ___location in the file system. Both sources forward the data to respective PACTs with Map Input Contracts. The user code is not shown in the figure. The output of both Map PACTs streams into a PACT with a Match Input Contract. The last PACT has a Reduce Input Contract and forwards its result to the data sink.
Line 104 ⟶ 93:
 
For a more detailed comparison of the MapReduce and PACT programming models you can read our paper //"MapReduce and PACT - Comparing Data Parallel Programming Models"// (see our [https://www.stratosphere.eu/index.php?q=publications|publications page]).
 
== See also ==
 
{{Empty section|date=December 2013}}
 
== References ==
 
{{reflist}}
* [https://web.archive.org/web/20120424184436/http://www.stratosphere.eu/files/NephelePACTs_10.pdf "Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing"]—paper by D. Battré, S. Ewen, F. Hueske, O. Kao, [[Volker Markl|V. Markl]], and D. Warneke from [http://www.tu-berlin.de/menue/home/parameter/en/ TU Berlin] published in Proc. of ACM SoCC 2010. The paper introduces the PACT programming model, a generalization of MapReduce, developed in the [http://www.stratosphere.eu Stratosphere] research project.
* [https://web.archive.org/web/20120424184448/http://www.stratosphere.eu/files/ComparingMapReduceAndPACTs_11.pdf "MapReduce and PACT - Comparing Data Parallel Programming Models"]—paper by A. Alexandrov, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Nijkamp, and D. Warneke from [http://www.tu-berlin.de/menue/home/parameter/en/ TU Berlin] published in Proc. of BTW 2011.