Automatic parallelization: Difference between revisions

Content deleted Content added
No edit summary
 
(46 intermediate revisions by 29 users not shown)
Line 1:
{{Short description|Method of improving computer program speed}}
{{multiple issues|
{{Expert-subject|ComputerMore citations scienceneeded|date=MarchFebruary 20112008}}
{{RefimproveUse dmy dates|date=FebruaryJanuary 20082022}}
{{Use list-defined references|date=January 2022}}
}}
'''Automatic parallelization''', also '''auto parallelization''', '''autoparallelization''', or '''parallelizationautoparallelization''', the last one of which implies automation when used in context, refers to converting sequential [[source code|code]] into [[multi-threaded]] and/or [[Automatic vectorization|vectorized]] (or even both) code in order to utilizeuse multiple processors simultaneously in a shared-memory [[multiprocessor]] ([[Symmetric multiprocessing|SMP]]) machine.<ref Thename="Yehezkael_2000"/> goal of automatic parallelization is to relieve programmers from the tedious and error-prone manual parallelization process. Though the quality of automatic parallelization has improved in the past several decades, fullyFully automatic parallelization of sequential programs by compilers remainsis a grand challenge due to itsbecause needit forrequires complex [[program analysis (computer science)|program analysis]] and the unknownbest factorsapproach (suchmay asdepend inputupon dataparameter range)values that duringare not known at compilation time.<ref name="pcworks">{{cite book | last = Fox | first = Geoffrey |author2=Roy Williams |author3=Paul Messina | title = Parallel Computing Works! | year = 1994 | publisher = Morgan Kaufmann | pages = 575, 593 | isbn = 978-1-55860-253-3 | page = }}<Williams_Messina_1994"/ref>
 
'''Automatic parallelization''', also '''auto parallelization''', '''autoparallelization''', or '''parallelization''', the last one of which implies automation when used in context, refers to converting sequential [[source code|code]] into [[multi-threaded]] or [[Automatic vectorization|vectorized]] (or even both) code in order to utilize multiple processors simultaneously in a shared-memory [[multiprocessor]] ([[Symmetric multiprocessing|SMP]]) machine. The goal of automatic parallelization is to relieve programmers from the tedious and error-prone manual parallelization process. Though the quality of automatic parallelization has improved in the past several decades, fully automatic parallelization of sequential programs by compilers remains a grand challenge due to its need for complex [[program analysis (computer science)|program analysis]] and the unknown factors (such as input data range) during compilation.<ref name="pcworks">{{cite book | last = Fox | first = Geoffrey |author2=Roy Williams |author3=Paul Messina | title = Parallel Computing Works! | year = 1994 | publisher = Morgan Kaufmann | pages = 575, 593 | isbn = 978-1-55860-253-3 | page = }}</ref>
 
The programming control structures on which autoparallelization places the most focus are [[Control flow#Loops|loop]]s, because, in general, most of the [[Run time (program lifecycle phase)|execution time]] of a program takes place inside some form of loop.
There are two main approaches to parallelization of loops: pipelined multi-threading and cyclic multi-threading.<ref name="Campanoni-Jones-Holloway-Wei-Brooks_2012"/> For example, consider a loop that on each iteration applies a hundred operations, and runs for a thousand iterations. This can be thought of as a grid of 100 columns by 1000 rows, a total of 100,000 operations. Cyclic multi-threading assigns each row to a different thread. Pipelined multi-threading assigns each column to a different thread.
Simone Campanoni, Timothy Jones, Glenn Holloway, Gu-Yeon Wei, David Brooks.
[http://helix.eecs.harvard.edu/index.php/DAC2012 "The HELIX Project: Overview and Directions"].
2012.
</ref>
 
==Automatic parallelization technique==
For example, consider a loop that on each iteration applies a hundred operations, runs for a thousand iterations. This can be thought of as a grid of 100 columns by 1000 rows, a total of 100,000 operations. Cyclic multi-threading assigns each row to a different thread. Pipelined multi-threading assigns each column to a different thread.
===Parse===
This is the first stage where the scanner will read the input source files to identify all static and extern usages. Each line in the file will be checked against pre-defined patterns to segregate into [[token (parser)|token]]s. These tokens will be stored in a file which will be used later by the
grammar engine. The grammar engine will check patterns of tokens that match with pre-defined rules to identify variables, loops, control
statements, functions etc. in the code.
 
===Analyze===
==Cyclic multi-threading==
The [[analyzer]] is used to identify sections of code that can be executed concurrently. The analyzer uses the static data information provided by the scanner-parser. The analyzer will first find all the totally independent functions and mark them as individual tasks. The analyzer then finds which tasks have dependencies.
 
===Schedule===
A cyclic multi-threading parallelizing compiler tries to split up a loop so that each [[iteration]] can be executed on a separate [[Microprocessor|processor]] concurrently.
The [[Scheduling (computing)|scheduler]] will list all the tasks and their dependencies on each other in terms of execution and start times. The scheduler will produce the optimal schedule in terms of number of processors to be used or the total execution time for the application.
 
===Code generation===
The [[Scheduling (computing)|scheduler]] will generate a list of all the tasks and the details of the cores on which they will execute along with the time that they will execute for. The code Generator will insert special constructs in the code that will be read during execution by the scheduler. These constructs will instruct the scheduler on which core a particular task will execute along with the start and end times.
 
==Cyclic multi-threading==
A cyclic multi-threading parallelizing compiler tries to [[loop splitting|split up a loop]] so that each [[iteration]] can be executed on a separate [[Microprocessormicroprocessor|processor]] concurrently.
 
===Compiler parallelization analysis===
The ''compiler'' usually conducts two passes of analysis before actual parallelization in order to determine the following:
 
* Is it safe to parallelize the loop? Answering this question needs accurate [[dependence analysis]] and [[alias analysis]]
* Is it worthwhile to parallelize it? This answer requires a reliable estimation (modeling) of the program workload and the capacity of the parallel system.
Line 29 ⟶ 37:
 
===Example===
 
A loop is called DOALL if all of its iterations, in any given invocation, can be executed concurrently.
 
The [[Fortran]] code below is DOALL, and can be auto-parallelized by a compiler because each iteration is independent of the others, and the final result of array <code>z</code> will be correct regardless of the execution order of the other iterations.
<sourcesyntaxhighlight lang=FORTRAN>
do i = 1, n
z(i) = x(i) + y(i)
enddo
</syntaxhighlight>
</source>
 
There are many [[pleasingly parallel]] problems that have such DOALL loops. For example, when [[parallel rendering|rendering]] a ray-traced movie, each frame of the movie can be independently rendered, and each pixel of a single frame may be independently rendered.
For example, when [[parallel rendering|rendering]] a ray-traced movie,
each frame of the movie can be independently rendered,
and each pixel of a single frame may be independently rendered.
 
On the other hand, the following code cannot be auto-parallelized, because the value of <code>z(i)</code> depends on the result of the previous iteration, <code>z(i - 1)</code>.
<sourcesyntaxhighlight lang=FORTRAN>
do i = 2, n
z(i) = z(i - 1)*2
enddo
</syntaxhighlight>
</source>
This does not mean that the code cannot be parallelized. Indeed, it is equivalent to the DOALL loop
 
<sourcesyntaxhighlight lang=FORTRAN>
This does not mean that the code cannot be parallelized. Indeed, it is equivalent to
 
<source lang=FORTRAN>
do i = 2, n
z(i) = z(1)*2**(i - 1)
enddo
</sourcesyntaxhighlight><!-- Yes, it would be more efficient to use bit-shifting, but let's keep it simple. -->
 
However, current parallelizing compilers are not usually capable of bringing out these parallelisms automatically, and it is questionable whether this code would benefit from parallelization in the first place. <!-- Really? That seems doubtful. Maybe we should have an example of tricky-to-parallelize code like this, and an example of something actually impossible to parallelize? -->
 
==Pipelined multi-threading==
{{main | software pipelining }}
 
A pipelined multi-threading parallelizing compiler tries to break up the sequence of operations inside a loop into a series of code blocks, such that each code block can be executed on separate [[Microprocessor|processor]]s concurrently.
such that each code block can be executed on separate [[Microprocessor|processor]]s concurrently.
 
There are many pleasingly parallel problems that have such relatively independent code blocks, in particular systems using [[pipes and filters]].
 
For example, when producing live broadcast television, the following tasks must be performed many times a second:
#Read a frame of raw pixel data from the image sensor,
#Do MPEG motion compensation on the raw data,
#Entropy compress the motion vectors and other data,
#Break up the compressed data into packets,
#Add the appropriate error correction and do a FFT to convert the data packets into [[COFDM]] signals, and
#Send the COFDM signals out the TV antenna.
 
# Read a frame of raw pixel data from the image sensor,
A pipelined multi-threading parallelizing compiler could assign each of these 6 operations to a different processor, perhaps arranged in a [[systolic array]],
# Do MPEG [[motion compensation]] on the raw data,
inserting the appropriate code to forward the output of one processor to the next processor.
# Entropy compress the motion vectors and other data,
# Break up the compressed data into packets,
# Add the appropriate error correction and do a FFT to convert the data packets into [[COFDM]] signals, and
# Send the COFDM signals out the TV antenna.
 
A pipelined multi-threading parallelizing compiler could assign each of these 6six operations to a different processor, perhaps arranged in a [[systolic array]], inserting the appropriate code to forward the output of one processor to the next processor.
Recent research focuses on using the power of GPU's<ref> J Anantpur, R Govindarajan, Runtime dependence computation and execution of loops on heterogeneous systems[http://hpc.serc.iisc.ernet.in/~jayvant/papers/CGO-2013.pdf]</ref> and multicore systems<ref>X. Zhuang, A. E. Eichenberger, Y. Luo, Kevin O’Brien, Kathryn, Exploiting Parallelism with Dependence-Aware Scheduling</ref> to compute such independent code blocks( or simply independent iterations of a loop) at runtime.
 
Recent research focuses on using the power of GPU's<ref> J name="Anantpur, R Govindarajan, Runtime dependence computation and execution of loops on heterogeneous systems[http://hpc.serc.iisc.ernet.in/~jayvant/papers/CGO-2013.pdf]<Govindarajan_2013"/ref> and multicore systems<ref>X. name="Zhuang, A. E. -Eichenberger, Y. -Luo, Kevin O’Brien, Kathryn, Exploiting Parallelism with Dependence-Aware Scheduling<OBrien"/ref> to compute such independent code blocks( or simply independent iterations of a loop) at runtime.
The memory accessed (whether direct or indirect) can be simply marked for different iterations of a loop and can be compared for dependency detection. Using this information, the iterations are grouped into levels such that iterations belonging to the same level are independent of each other, and can be executed in parallel.
 
==Difficulties==
Automatic parallelization by compilers or tools is very difficult due to the following reasons:<ref>{{cite web |urlname=http://blitzprog.org/posts/automatic-parallelism-and-data-dependency |title=Automatic parallelism and data dependency}}<"Blitzprog"/ref>
* dependence analysis is hard for code that uses indirect addressing, pointers, recursion, or indirect function calls because it is difficult to detect such dependencies at compile time;
* loops have an unknown number of iterations;
* accesses to global resources are difficult to coordinate in terms of memory allocation, I/O, and shared variables;
* ''irregular algorithms'' that use input-dependent indirection interfere with compile-time analysis and optimization.<ref>{{cite journal|lastname=Rünger|first=Gudula|title=Parallel Programming Models for Irregular Algorithms|journal=Parallel Algorithms and Cluster Computing|year=2006|volume=52|pages=3-23|doi=10.1007/3-540-33541-2_1}}<"Rünger_2006"/ref>
 
==Workaround==
Due to the inherent difficulties in full automatic parallelization, several easier approaches exist to get a parallel program in higher quality. They are:
*One Allowof these is to allow programmers to add "hints" to their programs to guide compiler parallelization, such as [[High Performance Fortran|HPF]] for [[distributed memory]] systems and [[OpenMP]] or [[OpenHMPP]] for [[shared memory (interprocess communication)|shared memory]] systems.
*Another Buildapproach is to build an interactive system between programmers and parallelizing tools/compilers. Notable examples are [[Vector Fabrics, B.V.|Vector Fabrics]]' Pareon, [[SUIF]] Explorer (The Stanford University Intermediate Format compiler), the Polaris compiler, and ParaWise (formally CAPTools).
*Finally, Hardwareanother approach is hardware-supported [[speculative multithreading]].
 
==Historical parallelizing compilers==
 
==Parallelizing compilers and tools==
{{See also|Automatic parallelization tool}}
 
Most research [[compiler]]s for automatic parallelization consider [[Fortran]] programs,{{Citation needed|date=July 2007}} because Fortran makes stronger guarantees about [[aliasing (computing)|aliasing]] than languages such as [[C (programming language)|C]]. Typical examples are:
* [http://www.ece.northwestern.edu/cpdc/Paradigm/Paradigm.html Paradigm compiler]
* [https://engineering.purdue.edu/Cetus/Documentation/manual/ch02s02.html Polaris compiler]
* Polaris compiler
* [https://scholarship.rice.edu/handle/1911/16677 Rice Fortran D compiler]
* [[SUIF]] compiler
* [https://dl.acm.org/doi/10.1155/1999/304639 Vienna Fortran compiler]
 
Recently, Aubert, Rubiano, Rusch, and [[Thomas Seiller|Seiller]]<ref>{{cite book |last1=Aubert |first1=Clément |last2=Rubiano |first2=Thomas
==References==
|last3=Rusch |first3=Neea |last4=Seiller |first4=Thomas |chapter= Distributing and Parallelizing Non-canonical Loops |title= Verification, Model Checking, and Abstract Interpretation |series=Lecture Notes in Computer Science |date=2023 |volume=13881 |pages=91–108 |doi=10.1007/978-3-031-24950-1_1 |isbn=978-3-031-24949-5 }}</ref> used a dependency analysis technique <ref>{{cite book |last1=Moyen |first1=Jean-Yves |last2=Rubiano |first2=Thomas |last3=Seiller |first3=Thomas |chapter=Loop Quasi-Invariant Chunk Detection |title=Automated Technology for Verification and Analysis |series=Lecture Notes in Computer Science |date=2017 |volume=10482 |pages=91–108 |doi=10.1007/978-3-319-68167-2_7|isbn=978-3-319-68166-5 }}</ref> to automatically parallelise loops in [[C (programming language)|C]] code.
{{reflist}}
 
==See also==
* [[Loop nest optimization]]
* [[PolytopeParallelization modelcontract]]
* [[Polytope model]] also known as Polyhedral model
* [[Scalable parallelism]]
* [[BMDFM]]
* [[Vectorization (disambiguation)|Vectorization]]
* [[SequenceL]]
 
==References==
{{reflist}}|refs=
<ref name="Yehezkael_2000">{{cite book |author-last=Yehezkael |author-first=Rafael |title=Applied Parallel Computing. New Paradigms for HPC in Industry and Academia |chapter=Experiments in Separating Computational Algorithm from Program Distribution and Communication |series=[[Lecture Notes in Computer Science]] |publisher=[[Springer Verlag]] |date=2000 |volume=1947 |pages=268–278 |doi=10.1007/3-540-70734-4_32 |isbn=978-3-540-41729-3 |chapter-url=http://u.cs.biu.ac.il/~wiseman/para2001.pdf}}</ref>
<ref name="Fox-Williams_Messina_1994">{{cite book |author-last1=Fox |author-first1=Geoffrey |author-first2=Roy |author-last2=Williams |author-first3=Paul |author-last3=Messina |title=Parallel Computing Works! |date=1994 |publisher=[[Morgan Kaufmann]] |pages=575, 593 |isbn=978-1-55860-253-3}}</ref>
<ref name="Campanoni-Jones-Holloway-Wei-Brooks_2012">{{cite book |title=The HELIX Project: Overview and Directions |author-first1=Simone |author-last1=Campanoni |author-first2=Timothy |author-last2=Jones |author-first3=Glenn |author-last3=Holloway |author-first4=Gu-Yeon |author-last4=Wei |author-first5=David |author-last5=Brooks |date=2012 |url=http://helix.eecs.harvard.edu/index.php/DAC2012}}</ref>
<ref name="Anantpur-Govindarajan_2013">{{cite web |title=Runtime dependence computation and execution of loops on heterogeneous systems |author-first1=J. |author-last1=Anantpur |author-first2=R. |author-last2=Govindarajan |url=http://hpc.serc.iisc.ernet.in/~jayvant/papers/CGO-2013.pdf |access-date=2015-10-05 |url-status=dead |archive-url=https://web.archive.org/web/20151006123251/http://hpc.serc.iisc.ernet.in/~jayvant/papers/CGO-2013.pdf |archive-date=2015-10-06}}</ref>
<ref name="Zhuang-Eichenberger-Luo-OBrien">{{citation |title=Exploiting Parallelism with Dependence-Aware Scheduling |author-first1=X. |author-last1=Zhuang |author-first2=A. E. |author-last2=Eichenberger |author-first3=Y. |author-last3=Luo |author-last4=O'Brien |author-first4=Kathryn Kevin |url=https://www.researchgate.net/publication/220884771}}</ref>
<ref name="Blitzprog">{{cite web |title=Automatic parallelism and data dependency |url=http://blitzprog.org/posts/automatic-parallelism-and-data-dependency |url-status=dead |archive-url=https://web.archive.org/web/20140714111836/http://blitzprog.org/posts/automatic-parallelism-and-data-dependency |archive-date=2014-07-14}}</ref>
<ref name="Rünger_2006">{{cite journal |title=Parallel Programming Models for Irregular Algorithms |author-last=Rünger |author-first=Gudula |journal=Parallel Algorithms and Cluster Computing |date=2006 |volume=52 |pages=3–23 |doi=10.1007/3-540-33541-2_1 |series=Lecture Notes in Computational Science and Engineering |isbn=978-3-540-33539-9}}</ref>
}}
 
==Further reading==
* {{cite magazine |title=Configuring parallel programs, Part 1: The Occam Transpiler, now under development, will make writing software for parallel processing easier |author-first=Dick |author-last=Pountain |magazine=[[BYTE (magazine)|BYTE]] |publisher=[[McGraw-Hill, Inc.]] |issn=0360-5280 |volume=14 |number=13 |series= |date=December 1989 |id=<!-- |ia=byte-magazine-1989-12 --> ark:/13960/t34188734 |pages=349–352 |url=https://archive.org/details/byte-magazine-1989-12/page/n382/mode/1up |access-date=2022-01-06}} (NB. Uses the term ''Occam transpiler'' as a synonym for a [[source-to-source compiler]] working as a [[pre-processor]] that takes a normal [[occam (programming language)|occam]] program as input and derives a new occam source code as output with link-to-channel assignments etc. added to it thereby ''[[computer configuration|configuring]]'' it for [[parallel processing (computing)|parallel processing]] to run as efficient as possible on a network of [[transputer]]s.)
 
{{Compiler optimizations}}
 
{{DEFAULTSORT:Automatic Parallelization}}
<!--Categories-->
[[Category:Articles with example Fortran code]]
[[Category:Compiler optimizations]]
[[Category:Parallel computing]]