Parallelization contract: Difference between revisions

Content deleted Content added
BG19bot (talk | contribs)
m WP:CHECKWIKI error fix. Section heading problem. Violates WP:MOSHEAD.
BG19bot (talk | contribs)
m External link with two brackets using AWB (9814)
Line 1:
{{Orphan|date=December 2013}}
 
The '''parallelization contract''' or '''PACT''' programming model is a generalization of the [[MapReduce]] [[programming model]] and uses [[Higher-order_functionorder function|second order functions]] to perform concurrent computations on large ([[Petabyte]]s) data sets in parallel.
 
== Overview ==
Line 26:
 
=== Input Contracts ===
 
 
Input Contracts split the input data of a PACT into independently processable subsets that are handed to the user function of the PACT.
Line 66 ⟶ 65:
 
=== Pact Record Data Model ===
 
 
In contrast to MapReduce, PACT uses a more generic data model of records ([[PactRecord|Pact Record]]) to pass data between functions. The Pact Record can be thought of as a tuple with a free schema. The interpretation of the fields of a record is up to the user function. A Key/Value pair (as in MapReduce) is a special case of that record with only two fields (the key and the value).
 
For input contracts that operate on keys (like //Reduce//, //Match//, or //CoGroup//, one specifies which combination of the record's fields make up the key. An arbitrary combination of fields may used. See the [[https://github.com/stratosphere-eu/stratosphere/blob/master/pact/pact-examples/src/main/java/eu/stratosphere/pact/example/relational/TPCHQuery3.java|TPCH Query Exampe]] on how programs defining //Reduce// and //Match// contracts on one or more fields and can be written to minimally move data between fields.
 
The record may be sparsely filled, i.e. it may have fields that have //null// values. It is legal to produce a record where for example only fields 2 and 5 are set. Fields 1, 3, 4 are interpreted to be //null//. Fields that are used by a contract as key fields may however not be null, or an exception is raised.
 
=== User code annotations ===
 
 
User code annotation are optional in the PACT programming model. They allow the developer to make certain behaviors of her/his user code explicit to the optimizer. The PACT optimizer can utilize that information to obtain more efficient execution plans. However, it will not impact the correctness of the result if a valid annotation was not attached to the user code. On the other hand, invalidly specified annotations might cause the computation of wrong results. In the following, we list the current set of available Output Contracts.
Line 88 ⟶ 85:
 
=== PACT Programs ===
 
 
PACT programs are constructed as data flow graphs that consist of data sources, PACTs, and data sinks. One or more data sources read files that contain the input data and generate records from those files. Those records are processed by one or more PACTs, each consisting of an Input Contract, user code, and optional code annotations. Finally, the results are written back to output files by one or more data sinks. In contrast to the MapReduce programming model, a PACT program can be arbitrary complex and has no fixed structure. \\
Line 97 ⟶ 93:
 
=== Advantages of PACT over MapReduce ===
 
 
* The PACT programming model encourages a more modular programming style. Although the number of user functions is usually higher, they are more fine-grain and focus on specific problems. Hence, interweaving of functionality which is common for MapReduce jobs can be avoided.
Line 106 ⟶ 101:
* PACTs specify data parallelization in a declarative way which leaves several degrees of freedom to the system. These degrees of freedom are an important prerequisite for automatic optimization. The [[PactCompiler|PACT compiler]] enumerate different execution strategies and chooses the strategy with the least estimated amount of data to ship. In contrast, Hadoop executes MapReduce jobs always with the same strategy.
 
For a more detailed comparison of the MapReduce and PACT programming models you can read our paper //"MapReduce and PACT - Comparing Data Parallel Programming Models"// (see our [[https://www.stratosphere.eu/index.php?q=publications|publications page]]).
 
== See also ==
Line 115 ⟶ 110:
 
{{reflist}}
* [http://stratosphere.eu/files/NephelePACTs_10.pdf "Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing"] -- paper—paper by D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke from [http://www.tu-berlin.de/menue/home/parameter/en/ TU Berlin] published in Proc. of ACM SoCC 2010. The paper introduces the PACT programming model, a generalization of MapReduce, developed in the [http://www.stratosphere.eu Stratosphere] research project.
* [http://stratosphere.eu/files/ComparingMapReduceAndPACTs_11.pdf "MapReduce and PACT - Comparing Data Parallel Programming Models"] -- paper—paper by A. Alexandrov, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Nijkamp, and D. Warneke from [http://www.tu-berlin.de/menue/home/parameter/en/ TU Berlin] published in Proc. of BTW 2011.
 
== Further reading ==