Analysis of parallel algorithms: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 15:35, 15 September 2015 edit Qwertyus (talk \| contribs) Extended confirmed users 31,640 edits →Execution on a limited number of processors: some interpretation ← Previous edit		Latest revision as of 11:51, 27 January 2025 edit undo John of Reading (talk \| contribs) Autopatrolled, Extended confirmed users, Pending changes reviewers 787,571 edits m →top: Typo fixing, fixed "the the" Tag: AWB
(30 intermediate revisions by 26 users not shown)
Line 1: {{short description\|Subfield of computer science}} This article discusses the [[analysis of algorithms\|analysis]] of [[parallel algorithm]]s running on [[parallel random-access machine]]s (PRAMs): models of [[parallel computing\|parallel computation]] where multiple [[Central processing unit\|processor]]s have [[shared memory architecture\|shared access]] to a [[random-access memory]] (RAM). Like in the analysis of "ordinary", sequential, algorithms, one is typically interested in [[Asymptotic analysis\|asymptotic]] bounds on the resource consumption (mainly time spent computing), but the analysis is performed in the presence of multiple processor units that cooperate to perform computations. Thus, one can determine not only how many "steps" a computation takes, but also how much faster it becomes as the number of processors goes up. In [[computer science]], '''analysis of parallel algorithms''' is the process of finding the [[computational complexity]] of [[algorithm]]s executed in [[Parallel computing\|parallel]] – the amount of time, storage, or other [[System resource\|resources]] needed to execute them. In many respects, analysis of [[parallel algorithm]]s is similar to [[analysis of algorithms\|the analysis]] of [[sequential algorithm]]s, but is generally more involved because one must reason about the behavior of multiple cooperating [[Thread (computing)\|threads]] of execution. One of the primary goals of parallel analysis is to understand how a parallel algorithm's use of resources (speed, space, etc.) changes as the number of processors is changed. ==Overview==▼ Suppose computations are executed on a PRAM machine using {{mvar\|p}} processors. Let {{mvar\|T<sub>p</sub>}} denote the time that expires between the start of the computation and its end. Analysis of the computation's [[Time complexity\|running time]] focuses on the following notions:▼ ==Background== * The ''work'' of a computation executed by {{mvar\|p}} processors is the total number of primitive operations that the processors perform.<ref name="casanova">{{cite book \|title=Parallel Algorithms \|first1=Henri \|last1=Casanova \|first2=Arnaud \|last2=Legrand \|first3=Yves \|last3=Robert \|publisher=CRC Press \|year=2008 \|pages=10}}</ref> Ignoring communication overhead from synchronizing the processors, this is equal to the time used to run the computation on a single processor, denoted {{math\|''T''<sub>1</sub>}}.▼ * The ''span'' is the length of the longest series of operations that have to be performed sequentially due to [[data dependency\|data dependencies]] (the ''critical path''). The span may also be called the ''critical path length'' or the ''depth'' of the computation.<ref name="cacm">{{cite journal \|first=Guy \|last=Blelloch \|title=Programming Parallel Algorithms \|journal=[[Communications of the ACM]] \|volume=39 \|issue=3 \|year=1996 \|doi=10.1145/227234.227246 \|url=https://www.cs.cmu.edu/afs/cs/Web/People/blelloch/papers/B85.pdf \|pages=85–97}}</ref> Minimizing the span is important in designing parallel algorithms, because the span determines the shortest possible execution time.<ref name="spp">{{cite book \|author1=Michael McCool \|author2=James Reinders \|author3=Arch Robison \|title=Structured Parallel Programming: Patterns for Efficient Computation \|publisher=Elsevier \|year=2013 \|pages=4–5}}</ref> Alternatively, the span can be defined as the time {{math\|''T''<sub>∞</sub>}} spent computing using an idealized machine with an infinite number of processors.<ref name="clrs">{{Introduction to Algorithms\|3\|pages=779–784}}</ref>▼ A so-called work-time (WT) (sometimes called work-depth, or work-span) framework was originally introduced by Shiloach and Vishkin <ref name="shiloach">{{cite journal \| last1 = Shiloach \| first1 = Yossi \| last2 = Vishkin \| first2 = Uzi \| year = 1982 \| title = An ''O''(''n''<sup>2</sup> log ''n'') parallel max-flow algorithm \| journal = Journal of Algorithms \| volume = 3 \| issue =2 \| pages = 128–146 \| doi =10.1016/0196-6774(82)90013-X }}</ref> for conceptualizing and describing parallel algorithms. In the WT framework, a parallel algorithm is first described in terms of parallel rounds. For each round, the operations to be performed are characterized, but several issues can be suppressed. For example, the number of operations at each round need not be clear, processors need not be mentioned and any information that may help with the assignment of processors to jobs need not be accounted for. Second, the suppressed information is provided. The inclusion of the suppressed information is guided by the proof of a scheduling theorem due to Brent,<ref name="brent">{{Cite journal\|last=Brent\|first=Richard P.\|date=1974-04-01\|title=The Parallel Evaluation of General Arithmetic Expressions\|journal=Journal of the ACM\|volume=21\|issue=2\|pages=201–206\|doi=10.1145/321812.321815\|issn=0004-5411\|citeseerx=10.1.1.100.9361\|s2cid=16416106}}</ref> which is explained later in this article. The WT framework is useful since while it can greatly simplify the initial description of a parallel algorithm, inserting the details suppressed by that initial description is often not very difficult. For example, the WT framework was adopted as the basic presentation framework in the parallel algorithms books (for the [[parallel random-access machine]] PRAM model) <ref name="jaja">{{cite book \| first = Joseph \| last = JaJa \| title = An Introduction to Parallel Algorithms \| edition = \| publisher = Addison-Wesley \| year = 1992 \| isbn = 978-0-201-54856-3 \| title-link = An Introduction to Parallel Algorithms }}</ref> and, <ref name="kkt">{{cite book \| first1 = Jorg \| last1 = Keller \| last2 = Kessler \| first2 = Cristoph W. \| last3 = Traeff \| first3 = Jesper L. \| title = Practical PRAM Programming \| edition = \| publisher = Wiley-Interscience \| year = 2001 \| isbn = 978-0-471-35351-5 \| title-link = Practical PRAM Programming }}</ref> as well as in the class notes .<ref name="uv">{{cite book \| authorlink = Uzi Vishkin \| first = Uzi \| last = Vishkin \| title = Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques, 104 pages \| edition = \| publisher = Class notes of courses on parallel algorithms taught since 1992 at the University of Maryland, College Park, Tel Aviv University and the Technion \| year = 2009 \| isbn = \| url=http://www.umiacs.umd.edu/users/vishkin/PUBLICATIONS/classnotes.pdf }}</ref> The overview below explains how the WT framework can be used for analyzing more general parallel algorithms, even when their description is not available within the WT framework. ==Definitions== ▲=={{anchor\|Overview==}} ▲Suppose computations are executed on a ~~PRAM~~ machine ~~using~~that has {{mvar\|p}} processors. Let {{mvar\|T<sub>p</sub>}} denote the time that expires between the start of the computation and its end. Analysis of the computation's [[Time complexity\|running time]] focuses on the following notions: ▲* The ''work'' of a computation executed by {{mvar\|p}} processors is the total number of primitive operations that the processors perform.<ref name="casanova">{{cite book \|title=Parallel Algorithms \|first1=Henri \|last1=Casanova \|first2=Arnaud \|last2=Legrand \|first3=Yves \|last3=Robert \|publisher=CRC Press \|year=2008 \|pages=10\|citeseerx=10.1.1.466.8142 }}</ref> Ignoring communication overhead from synchronizing the processors, this is equal to the time used to run the computation on a single processor, denoted {{math\|''T''<sub>1</sub>}}. ▲* The ''depth'' or ''span'' is the length of the longest series of operations that have to be performed sequentially due to [[data dependency\|data dependencies]] (the ''{{visible anchor\|Critical path\|text=critical path}}''). The ~~span~~depth may also be called the ''critical path length~~'' or the ''depth~~'' of the computation.<ref name="cacm">{{cite journal \|first=Guy \|last=Blelloch \|title=Programming Parallel Algorithms \|journal=[[Communications of the ACM]] \|volume=39 \|issue=3 \|year=1996 \|doi=10.1145/227234.227246 \|url=https://www.cs.cmu.edu/afs/cs/Web/People/blelloch/papers/B85.pdf \|pages=85–97\|citeseerx=10.1.1.141.5884 \|s2cid=12118850 }}</ref> Minimizing the depth/span is important in designing parallel algorithms, because the depth/span determines the shortest possible execution time.<ref name="spp">{{cite book \|author1=Michael McCool \|author2=James Reinders \|author3=Arch Robison \|title=Structured Parallel Programming: Patterns for Efficient Computation \|publisher=Elsevier \|year=2013 \|pages=4–5}}</ref> Alternatively, the span can be defined as the time {{math\|''T''<sub>∞</sub>}} spent computing using an idealized machine with an infinite number of processors.<ref name="clrs">{{Introduction to Algorithms\|3\|pages=779–784}}</ref> * The ''cost'' of the computation is the quantity {{mvar\|pT<sub>p</sub>}}. This expresses the total time spent, by all processors, in both computing and waiting.<ref name="casanova"/> Line 15 ⟶ 68: Using these definitions and laws, the following measures of performance can be given: * ''[[Speedup]]'' is the gain in speed made by parallel execution compared to sequential execution: {{math\|1=''S<sub>p</sub>'' {{=}} ''T''<sub>1</sub> ∕/ ''T<sub>p</sub>''}}. When the speedup is {{math\|Ω(''np'')}} for ~~input size~~ {{mvar\|np}} processors (using [[big O notation]]), the speedup is linear, which is optimal in ~~the~~simple models ~~PRAM~~of ~~model~~computation because the work law implies that {{math\|''T''<sub>1</sub> ∕/ ''T<sub>p</sub>'' ≤ ''p''}} ([[Speedup#Super-linear speedup\|super-linear speedup]] can occur in practice due to [[memory hierarchy]] effects). The situation {{math\|1=''T''<sub>1</sub> ∕/ ''T<sub>p</sub>'' {{=}} ''p''}} is called perfect linear speedup.<ref name="clrs"/> An algorithm that exhibits linear speedup is said to be [[scalability\|scalable]].<ref name="casanova"/> Analytical expressions for the speedup of many important parallel algorithms are presented in this book.<ref>{{Cite book \|last=Kurgalin \|first=Sergei \|title=The discrete math workbook: a companion manual using Python \|last2=Borzunov \|first2=Sergei \|date=2020 \|publisher=Springer Naturel \|isbn=978-3-030-42220-2 \|edition=2nd \|series=Texts in Computer Science \|___location=Cham, Switzerland}}</ref> * ''Efficiency'' is the speedup per processor, {{math\|''S<sub>p</sub>'' ∕/ ''p''}}.<ref name="casanova"/> * ''Parallelism'' is the ratio {{math\|''T''<sub>1</sub> ∕/ ''T''<sub>∞</sub>}}. It represents the maximum possible speedup on any number of processors. By the span law, the parallelism bounds the speedup: if {{math\|''p'' > ''T''<sub>1</sub> ∕/ ''T''<sub>∞</sub>''}}, then ~~{{math\|''T''~~:<~~sub>1</sub>~~ref ∕name="clrs" ~~''T''<sub>p<~~/~~sub~~> ≤<math ~~''T''<sub>1</sub~~display="block">\frac{T_1}{T_p} ∕\leq ~~''T<sub>∞</sub>''~~\frac{T_1}{T_\infty} < ''p~~''}}~~.<~~ref name="clrs"~~/math> * The ''slackness'' is {{math\|''T''<sub>1</sub> ∕/ (''pT''<sub>∞</sub>)}}. A slackness less than one implies (by the span law) that perfect linear speedup is impossible on {{mvar\|p}} processors.<ref name="clrs" /> ==Execution on a limited number of processors== Analysis of parallel algorithms is usually carried out under the assumption that an unbounded number of processors is available. This is unrealistic, but not a problem, since any computation that can run in parallel on {{mvar\|N}} processors can be executed on {{math\|''p'' < ''N''}} processors by letting each processor execute multiple units of work. A result called '''Brent's law''' states that one can perform such a "simulation" in time {{mvar\|T<sub>p</sub>}}, bounded by<ref>{{cite encyclopedia \|encyclopedia=Encyclopedia of Parallel Computing \|year=2011 \|pages=182–185 \|title=Brent's Theorem \|first=John L. \|last=Gustafson \|~~url~~doi=~~http://link.springer.com/referenceworkentry/~~10.1007/978-0-387-09766-4_80 \|isbn=978-0-387-09765-7 }}</ref> :<math>T_p \leq T_N + \frac{T_1 - T_N}{p},</math> Line 31 ⟶ 84: An alternative statement of the law bounds {{mvar\|T<sub>p</sub>}} above and below by :<math>\frac{T_1}{p} \leq T_p <\leq \frac{T_1}{p} + T_\infty</math>. showing that the span/ (depth) {{math\|''T''<sub>∞</sub>}} and the work {{math\|''T''<sub>1</sub>}} together provide reasonable bounds on the computation time.<ref name="~~cacm~~brent"/> ==References== Line 40 ⟶ 93: {{Parallel Computing}} [[Category:Analysis of parallel algorithms\| ]]