Content deleted Content added
m WP:CHECKWIKI error fix. Section heading problem. Violates WP:MOSHEAD. |
m External link with two brackets using AWB (9814) |
||
Line 1:
{{Orphan|date=December 2013}}
The '''parallelization contract''' or '''PACT''' programming model is a generalization of the [[MapReduce]] [[programming model]] and uses [[Higher-
== Overview ==
Line 26:
=== Input Contracts ===
Input Contracts split the input data of a PACT into independently processable subsets that are handed to the user function of the PACT.
Line 66 ⟶ 65:
=== Pact Record Data Model ===
In contrast to MapReduce, PACT uses a more generic data model of records ([[PactRecord|Pact Record]]) to pass data between functions. The Pact Record can be thought of as a tuple with a free schema. The interpretation of the fields of a record is up to the user function. A Key/Value pair (as in MapReduce) is a special case of that record with only two fields (the key and the value).
For input contracts that operate on keys (like //Reduce//, //Match//, or //CoGroup//, one specifies which combination of the record's fields make up the key. An arbitrary combination of fields may used. See the
The record may be sparsely filled, i.e. it may have fields that have //null// values. It is legal to produce a record where for example only fields 2 and 5 are set. Fields 1, 3, 4 are interpreted to be //null//. Fields that are used by a contract as key fields may however not be null, or an exception is raised.
=== User code annotations ===
User code annotation are optional in the PACT programming model. They allow the developer to make certain behaviors of her/his user code explicit to the optimizer. The PACT optimizer can utilize that information to obtain more efficient execution plans. However, it will not impact the correctness of the result if a valid annotation was not attached to the user code. On the other hand, invalidly specified annotations might cause the computation of wrong results. In the following, we list the current set of available Output Contracts.
Line 88 ⟶ 85:
=== PACT Programs ===
PACT programs are constructed as data flow graphs that consist of data sources, PACTs, and data sinks. One or more data sources read files that contain the input data and generate records from those files. Those records are processed by one or more PACTs, each consisting of an Input Contract, user code, and optional code annotations. Finally, the results are written back to output files by one or more data sinks. In contrast to the MapReduce programming model, a PACT program can be arbitrary complex and has no fixed structure. \\
Line 97 ⟶ 93:
=== Advantages of PACT over MapReduce ===
* The PACT programming model encourages a more modular programming style. Although the number of user functions is usually higher, they are more fine-grain and focus on specific problems. Hence, interweaving of functionality which is common for MapReduce jobs can be avoided.
Line 106 ⟶ 101:
* PACTs specify data parallelization in a declarative way which leaves several degrees of freedom to the system. These degrees of freedom are an important prerequisite for automatic optimization. The [[PactCompiler|PACT compiler]] enumerate different execution strategies and chooses the strategy with the least estimated amount of data to ship. In contrast, Hadoop executes MapReduce jobs always with the same strategy.
For a more detailed comparison of the MapReduce and PACT programming models you can read our paper //"MapReduce and PACT - Comparing Data Parallel Programming Models"// (see our
== See also ==
Line 115 ⟶ 110:
{{reflist}}
* [http://stratosphere.eu/files/NephelePACTs_10.pdf "Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing"]
* [http://stratosphere.eu/files/ComparingMapReduceAndPACTs_11.pdf "MapReduce and PACT - Comparing Data Parallel Programming Models"]
== Further reading ==
|