Parallelization contract: Difference between revisions

Content deleted Content added
Rybec (talk | contribs)
Cleanup following Wikipedia:Articles for creation creation (AFCH)
m Tagging, added Empty section (1) tag using AWB (9814)
Line 1:
{{Orphan|date=December 2013}}
 
 
The '''parallelization contract''' or '''PACT''' programming model is a generalization of the [[MapReduce]] [[programming model]] and uses [[Higher-order_function|second order functions]] to perform concurrent computations on large ([[Petabyte]]s) data sets in parallel.
Line 18 ⟶ 17:
* User code annotations
 
The figure below shows how those components work together. Input Contracts split the input data into independently processable subset. The user code is called for each of these independent subsets. All calls can be executed in parallel, because the subsets are independent.
 
Optionally, the user code can be annotated with additional information. These annotations disclose some information on the behavior of the black-box user function. The [[PactCompiler|PACT Compiler]] can utilize the information to obtain more efficient execution plans. However, while a missing annotation will not change the result of the execution, an incorrect Output Contract produces wrong results.
Line 29 ⟶ 28:
 
Input Contracts split the input data of a PACT into independently processable subsets that are handed to the user function of the PACT.
Input Contracts vary in the number of data inputs and the way how independent subsets are generated.
 
More formally, Input Contracts are second-order functions with a first-order function (the user code), one or more input sets, and none or more key fields per input as parameters. The first-order function is called (one or) multiple times with subsets of the input set(s). Since the first-order functions have no side effects, each call is independent from each other and all calls can be done in parallel.
 
The second-order functions ''map()'' and ''reduce()'' of the MapReduce programming model are Input Contracts in the context of the PACT programming model.
Line 89 ⟶ 88:
PACT programs are constructed as data flow graphs that consist of data sources, PACTs, and data sinks. One or more data sources read files that contain the input data and generate records from those files. Those records are processed by one or more PACTs, each consisting of an Input Contract, user code, and optional code annotations. Finally, the results are written back to output files by one or more data sinks. In contrast to the MapReduce programming model, a PACT program can be arbitrary complex and has no fixed structure. \\
 
The figure below shows a PACT program with two data sources, four PACTs, and one data sink. Each data source reads data from a specified ___location in the file system. Both sources forward the data to respective PACTs with Map Input Contracts. The user code is not shown in the figure. The output of both Map PACTs streams into a PACT with a Match Input Contract. The last PACT has a Reduce Input Contract and forwards its result to the data sink.
 
{{:wiki:pactProgram.png?nolink&600|}}
Line 105 ⟶ 104:
 
== See also ==
 
{{Empty section|date=December 2013}}
 
== References ==
 
{{reflist}}
 
* [http://stratosphere.eu/files/NephelePACTs_10.pdf "Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing"] -- paper by D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke from [http://www.tu-berlin.de/menue/home/parameter/en/ TU Berlin] published in Proc. of ACM SoCC 2010. The paper introduces the PACT programming model, a generalization of MapReduce, developed in the [http://www.stratosphere.eu Stratosphere] research project.
* [http://stratosphere.eu/files/ComparingMapReduceAndPACTs_11.pdf "MapReduce and PACT - Comparing Data Parallel Programming Models"] -- paper by A. Alexandrov, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Nijkamp, and D. Warneke from [http://www.tu-berlin.de/menue/home/parameter/en/ TU Berlin] published in Proc. of BTW 2011.
 
== Further reading ==
 
*[http://users.sdsc.edu/~jianwu/JianwuWang_files/ICCS-bioKepler.pdf A Framework for Distributed Data-Parallel Execution in the Kepler Scientific Workflow System]
 
Line 122 ⟶ 121:
* [http://www.hpts.ws/papers/2011/posters/Poster2011_06_Bodner.pdf Stratosphere slide presentation]
* Video Lecture [http://www.tele-task.de/de/archive/lecturer/1853/ Parallel Dataflow Programming]
 
[[Category:Parallel computing]]
[[Category:Distributed computing architecture]]