Loop-level parallelism: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 08:48, 24 November 2019 edit Citation bot (talk \| contribs) Bots 5,863,401 edits m Alter: doi-broken-date. Add: url. \| You can use this bot yourself. Report bugs here.\| Activated by User:Headbomb \| via #UCB_Headbomb ← Previous edit		Latest revision as of 00:27, 2 May 2024 edit undo Pancho507 (talk \| contribs) Extended confirmed users 23,934 edits →See also: add doacross parallelism
(10 intermediate revisions by 6 users not shown)
Line 8: Consider the following code operating on a list <code>L</code> of length <code>n</code>. {{sxhl\|lang=c\|1= ~~<pre>~~ for (int i = 0; i < n; i++i) { S1: ~~L[i] =~~ L[i] += 10; } }} ~~</pre>~~ Each iteration of the loop takes the value from the current index of <code>L</code>, and increments it by 10. If statement <code>S1</code> takes <code>T</code> time to execute, then the loop takes time <code>n * T</code> to execute sequentially, ignoring time taken by loop constructs. Now, consider a system with <code>p</code> processors where <code>p > n</code>. If <code>n</code> threads run in parallel, the time to execute all <code>n</code> steps is reduced to <code>T</code>. Line 18: Less simple cases produce inconsistent, i.e. [[serializability\|non-serializable]] outcomes. Consider the following loop operating on the same list <code>L</code>. {{sxhl\|lang=c\|1= ~~<pre>~~ for (int i = 1; i < n; i++i) { S1: L[i] = L[i - 1] + 10; } }} ~~</pre>~~ Each iteration sets the current index to be the value of the previous plus ten. When run sequentially, each iteration is guaranteed that the previous iteration will already have the correct value. With multiple threads, [[process scheduling]] and other considerations prevent the execution order from guaranteeing an iteration will execute only after its dependence is met. It very well may happen before, leading to unexpected results. Serializability can be restored by adding synchronization to preserve the dependence on previous iterations. Line 28: == Dependencies in code == There are several types of dependences that can be found within code.<ref name="Solihin">{{cite book\|last1=Solihin\|first1=Yan\|title=Fundamentals of Parallel Architecture\|date=2016\|publisher=CRC Press\|___location=Boca Raton, FL\|isbn=978-1-4822-1118-4}}</ref><ref>{{cite ~~journal~~book\|last1=Goff\|first1=Gina\|title=Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation - PLDI '91\|pages=15–29\|chapter=Practical dependence testing\|~~journal~~doi=~~Sigplan\|url=http://delivery.acm.org/~~10.1145/~~120000/113448/p15-goff~~113445.~~pdf?ip=152.7.224.7&id=~~113448~~&acc=ACTIVE%20SERVICE&key=6ABC8B4C00F6EE47%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=667494229&CFTOKEN=16697834&__acm__=1473798493_e58dcb18e741b6e6ac1c1c728fc5508d~~\|~~accessdate~~year=~~13 September 2016~~1991\|~~doi~~isbn=~~10.1145/120000/113448/p15-goff~~0897914287\|~~doi-broken-date~~s2cid=~~2019-11-24~~2357293 }}</ref> {\| class="wikitable" Line 56: === Example of true dependence === {{sxhl\|lang=c\|1= ~~<pre>~~ S1: int a, b; S2: a = 2; S3: b = a + 40; }} ~~</pre>~~ <code>S2 ->T S3</code>, meaning that S2 has a true dependence on S3 because S2 writes to the variable <code>a</code>, which S3 reads from. === Example of anti-dependence === {{sxhl\|lang=c\|1= ~~<pre>~~ S1: int a, b = 40; S2: a = b - 38; S3: b = -1; }} ~~</pre>~~ <code>S2 ->A S3</code>, meaning that S2 has an anti-dependence on S3 because S2 reads from the variable <code>b</code> before S3 writes to it. === Example of output-dependence === {{sxhl\|lang=c\|1= ~~<pre>~~ S1: int a, b = 40; S2: a = b - 38; S3: a = 2; }} ~~</pre>~~ <code>S2 ->O S3</code>, meaning that S2 has an output dependence on S3 because both write to the variable <code>a</code>. === Example of input-dependence === {{sxhl\|lang=c\|1= ~~<pre>~~ S1: int a, b, c = 2; S2: a = c - 1; S3: b = c + 1; }} ~~</pre>~~ <code>S2 ->I S3</code>, meaning that S2 has an input dependence on S3 because S2 and S3 both read from variable <code>c</code>. Line 99: In the following example code used for swapping the values of two array of length n, there is a loop-independent dependence of <code>S1 ->T S3</code>. {{sxhl\|lang=c\|1= ~~<pre>~~ for (int i = 1; i < n; i ++i) { S1: tmp = a[i]; S2: a[i] = b[i]; S3: b[i] = tmp; } }} ~~</pre>~~ In loop-carried dependence, statements in an iteration of a loop depend on statements in another iteration of the loop. Loop-Carried Dependence uses a modified version of the dependence notation seen earlier. Example of loop-carried dependence where <code>S1[i] ->T S1[i + 1]</code>, where <code>i</code> indicates the current iteration, and <code>i + 1</code> indicates the next iteration. {{sxhl\|lang=c\|1= ~~<pre>~~ for (int i = 1; i < n; i ++i) { S1: a[i] = a[i - 1] + 1; } }} ~~</pre>~~ === Loop carried dependence graph === Line 130: * DOPIPE Parallelism Each implementation varies slightly in how threads synchronize, if at all. In addition, parallel tasks must somehow be mapped to a process. These tasks can either be allocated statically or dynamically. Research has shown that load-balancing can be better achieved through some dynamic allocation algorithms than when done statically.<ref>{{cite journal\|last1=Kavi\|first1=Krishna\|title=Parallelization of DOALL and DOACROSS Loops-a Survey\|~~ref~~url=https://www.researchgate.net/publication/~~220662641_Parallelization_of_DOALL_and_DOACROSS_Loops-a_Survey~~220662641}}</ref> The process of parallelizing a sequential program can be broken down into the following discrete steps.<ref name="Solihin" /> Each concrete loop-parallelization below implicitly performs them. Line 154: === DISTRIBUTED loop === When a loop has a loop-carried dependence, one way to parallelize it is to distribute the loop into several different loops. Statements that are not dependent on each other are separated so that these distributed loops can be executed in parallel. For example, consider the following code. {{sxhl\|lang=c\|1= ~~<pre>~~ for (int i = 1; i < n; i ++i) { S1: a[i] = a[i -1] + b[i]; S2: ~~c[i] =~~ c[i] += d[i]; } }} ~~</pre>~~ The loop has a loop carried dependence <code>S1[i] ->T S1[i + 1]</code> but S2 and S1 do not have a loop-independent dependence so we can rewrite the code as follows. {{sxhl\|lang=c\|1= ~~<pre>~~ loop1: for (int i = 1; i < n; i ++i) { S1: a[i] = a[i -1] + b[i]; } loop2: for (int i = 1; i < n; i ++i) { S2: ~~c[i] =~~ c[i] += d[i]; } }} ~~</pre>~~ Note that now loop1 and loop2 can be executed in parallel. Instead of single instruction being performed in parallel on different data as in data level parallelism, here different loops perform different tasks on different data. Let's say the time of execution of S1 and S2 be <math>~~Ts1~~T_{S_1}</math> and <math>~~Ts2~~T_{S_2} </math> then the execution time for sequential form of above code is <math>n(~~Ts1~~T_{S_1}+~~Ts2~~T_{S_2})</math>, Now because we split the two statements and put them in two different loops, gives us an execution time of <math>n~~Ts1~~T_{S_1} + ~~Ts2~~T_{S_2}</math>. We call this type of parallelism either function or task parallelism. === DOALL parallelism === Line 176: DOALL parallelism exists when statements within a loop can be executed independently (situations where there is no loop-carried dependence).<ref name="Solihin" /> For example, the following code does not read from the array <code>a</code>, and does not update the arrays <code>b, c</code>. No iterations have a dependence on any other iteration. {{sxhl\|lang=c\|1= ~~<pre>~~ for (int i = 0; i < n; i++i) { S1: a[i] = b[i] + c[i]; } }} ~~</pre>~~ Let's say the time of one execution of S1 be <math>~~Ts1~~T_{S_1}</math> then the execution time for sequential form of above code is <math>n~~Ts1~~T_{S_1}</math>, Now because DOALL Parallelism exists when all iterations are independent, speed-up may be achieved by executing all iterations in parallel which gives us an execution time of <math>~~Ts1~~T_{S_1}</math>, which is the time taken for one iteration in sequential execution. The following example, using a simplified pseudo code, shows how a loop might be parallelized to execute each iteration independently. {{sxhl\|lang=c\|1= ~~<pre>~~ begin_parallelism(); for (int i = 0; i < n; i++i) { S1: a[i] = b[i] + c[i]; end_parallelism(); } block(); }} ~~</pre>~~ === DOACROSS parallelism === DOACROSS Parallelism exists where iterations of a loop are parallelized by extracting calculations that can be performed independently and running them simultaneously.<ref>{{citation\|last1=Unnikrishnan\|first1=Priya\|title=Euro-Par 2012 Parallel Processing\|volume=7484\|pages=219–231\|doi=10.1007/978-3-642-32820-6_23\|series=Lecture Notes in Computer Science\|year=2012\|isbn=978-3-642-32819-0\|chapter=A Practical Approach to DOACROSS Parallelization\|~~url~~s2cid=~~https://semanticscholar.org/paper/0885cd07bc4affd8f433bd3b4ee56012101ae09a~~18571258 \|doi-access=free}}</ref> Synchronization exists to enforce loop-carried dependence. Consider the following, synchronous loop with dependence <code>S1[i] ->T S1[i + 1]</code>. {{sxhl\|lang=c\|1= ~~<pre>~~ for (int i = 1; i < n; i++i) { a[i] = a[i - 1] + b[i] + 1; } }} ~~</pre>~~ Each loop iteration performs two actions Calculate <code>a[i - 1] + b[i] + 1</code> * Assign the value to <code>a[i]</code> Calculating the value <code>a[i - 1] + b[i] + 1</code>, and then performing the assignment can be decomposed into two lines(statements S1 and S2): {{sxhl\|lang=c\|1= ~~<pre>~~ S1: int tmp = b[i] + 1; S2: a[i] = a[i - 1] + tmp; }} ~~</pre>~~ The first line, <code>int tmp = b[i] + 1;</code>, has no loop-carried dependence. The loop can then be parallelized by computing the temp value in parallel, and then synchronizing the assignment to <code>a[i]</code>. {{sxhl\|lang=c\|1= ~~<pre>~~ post(0); for (int i = 1; i < n; i++i) { S1: int tmp = b[i] + 1; wait(i - 1); S2: a[i] = a[i - 1] + tmp; post(i); } }} ~~</pre>Let's say the time of execution of S1 and S2 be <math>Ts1</math> and <math>Ts2~~ Let's say the time of execution of S1 and S2 be <math>T_{S_1}</math> and <math>T_{S_2} </math> then the execution time for sequential form of above code is <math>n(~~Ts1~~T_{S_1}+~~Ts2~~T_{S_2})</math>, Now because DOACROSS Parallelism exists, speed-up may be achieved by executing iterations in a pipelined fashion which gives us an execution time of <math>~~Ts1~~T_{S_1} + n~~Ts2~~T_{S_2}</math>. === DOPIPE parallelism === Line 238 ⟶ 239: DOPIPE Parallelism implements pipelined parallelism for loop-carried dependence where a loop iteration is distributed over multiple, synchronized loops.<ref name="Solihin" /> The goal of DOPIPE is to act like an assembly line, where one stage is started as soon as there is sufficient data available for it from the previous stage.<ref>{{cite web\|title=DoPipe: An Effective Approach to Parallelize Simulation\|url=https://software.intel.com/sites/default/files/m/a/a/7/d/6/12758-MC_Forum_Zangbinyu_dopipe.pdf\|website=Intel\|accessdate=13 September 2016}}</ref> Consider the following, synchronous code with dependence <code>S1[i] ->T S1[i + 1]</code>. {{sxhl\|lang=c\|1= ~~<pre>~~ for (int i = 1; i < n; i++i) { S1: a[i] = a[i - 1] + b[i]; S2: ~~c[i] =~~ c[i] += a[i]; } }} ~~</pre>~~ S1 must be executed sequentially, but S2 has no loop-carried dependence. S2 could be executed in parallel using DOALL Parallelism after performing all calculations needed by S1 in series. However, the speedup is limited if this is done. A better approach is to parallelize such that the S2 corresponding to each S1 executes when said S1 is finished. Line 251 ⟶ 252: Implementing pipelined parallelism results in the following set of loops, where the second loop may execute for an index as soon as the first loop has finished its corresponding index. {{sxhl\|lang=c\|1= ~~<pre>~~ for (int i = 1; i < n; i++i) { S1: a[i] = a[i - 1] + b[i]; post(i); } Line 259 ⟶ 260: for (int i = 1; i < n; i++) { wait(i); S2: ~~c[i] =~~ c[i] += a[i]; } }} ~~</pre>~~ Let's say the time of execution of S1 and S2 be <math>~~Ts1~~T_{S_1}</math> and <math>~~Ts2~~T_{S_2} </math> then the execution time for sequential form of above code is <math>n(~~Ts1~~T_{S_1}+~~Ts2~~T_{S_2})</math>, Now because DOPIPE Parallelism exists, speed-up may be achieved by executing iterations in a pipelined fashion which gives us an execution time of <math>n~~Ts1~~T_{S_1} + (n/p)~~Ts2~~T_{S_2}</math>, where {{mvar\|p}} is the number of processor in parallel. == See also == [[Data parallelism]] * [[DOACROSS parallelism]] * [[Task parallelism]] * Parallelism using different types of memory models like [[Shared memory\|shared]] and [[Distributed memory\|distributed]] and [[Message Passing Interface\|Message Passing]]