Parallel external memory: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 10:42, 23 January 2019 edit Merch173 (talk \| contribs) 11 edits No edit summary ← Previous edit		Latest revision as of 08:49, 16 October 2023 edit undo Cedar101 (talk \| contribs) Extended confirmed users 18,554 edits m →Examples: : <math> \left \right
(34 intermediate revisions by 11 users not shown)
Line 1: [[File:Parallel External Memory Model PEM.png\|thumb\|400x400px\|PEM Model]] = Parallel external memory (model) =▼ In computer science, a '''parallel external memory (PEM) model''' is a [[Cache-aware model\|cache-aware]], external-memory [[abstract machine]].<ref name=":0">{{Cite ~~journal~~book\|~~last~~last1=Arge\|~~first~~first1=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari~~\|date=2008~~\|title~~=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal~~=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures \|chapter=Fundamental parallel algorithms for private-cache ~~SPAA~~chip multiprocessors ~~'08~~\|date=2008\|pages=197–206\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739\|s2cid=11067041 }}</ref> It is the parallel-computing analogy to the single-processor [[External memory algorithm\|external memory]] (EM) model. In a similar way, it is the cache-aware analogy to the [[parallel random-access machine]] (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory. __TOC__ == Model == === Definition === The ~~parallel external memory (~~PEM) model<ref name=":0">{{Cite ~~journal~~book\|~~last~~last1=Arge\|~~first~~first1=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari~~\|date=2008~~\|title~~=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal~~=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures \|chapter=Fundamental parallel algorithms for private-cache ~~SPAA~~chip ~~'08~~multiprocessors \|date=2008\|pages=197–206\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739\|s2cid=11067041 }}</ref> is a combination of the ~~[[External memory algorithm\|external memory]] (~~EM) model and the ~~[[Parallel Random Access Machine \| parallel random access memory]] (~~PRAM) model. The ~~parallel external memory (~~PEM) model is a computation model which consists of <math>P</math> processors and a two-level [[~~Memory hierarchy \|~~ memory hierarchy]]. This ~~[[Memory hierarchy \|~~ memory hierarchy]] consists of a large [[External memory algorithm \| external memory]] (main memory) of size <math>N</math> and <math>P</math> small [[Cache (computing) \| internal memories (caches)]]. The processors share the ~~The [[External memory algorithm \|~~main memory]]. Each ~~[[Cache (computing) \|~~cache]] is exclusive to a single processor. A processor can't access another’s ~~[[Cache (computing) \|~~cache]]. The ~~[[Cache (computing) \|~~caches]] have a size <math>M</math> which is partitioned in blocks of size <math>B</math>. The processors can only perform operations on data which are in their ~~[[Cache (computing) \|~~cache]]. The data can be transferred between the ~~[[External memory algorithm \|~~main memory]] and the ~~[[Cache (computing) \|~~cache]] in blocks of size <math>B</math>. === I/O complexity === The [[Programming complexity \| complexity measure]] of the PEM model is the I/O complexity,<ref name=":0">{{Cite ~~journal~~book\|~~last~~last1=Arge\|~~first~~first1=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari~~\|date=2008~~\|title~~=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal~~=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures \|chapter=Fundamental parallel algorithms for private-cache ~~SPAA~~chip ~~'08~~multiprocessors \|date=2008\|pages=197–206\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739\|s2cid=11067041 }}</ref>, which determines the number of parallel blocks transfers between the ~~[[External memory algorithm \|~~main memory]] and the ~~[[Cache (computing) \|~~cache]]. During a parallel block transfer each processor can transfer a block. So if <math>P</math> processors load parallelly a data block of size <math>B</math> form the ~~[[External memory algorithm \|~~main memory]] into their ~~[[Cache (computing) \|~~caches]], it is considered as an I/O complexity of <math>O(1)</math> not <math>O(P)</math>. A program in the PEM model should minimize the data transfer between ~~[[External memory algorithm \|~~main memory]] and ~~[[Cache (computing) \|~~caches]] and operate as much as possible on the data in the ~~[[Cache (computing) \|~~caches]]. === Read / ~~Write~~write conflicts === In the PEM model, there is no [[Computer network \| direct communication network]] between the P processors. The processors have to communicate indirectly over the ~~[[External memory algorithm \|~~main memory]]. If multiple processors try to access the same block in ~~[[External memory algorithm \|~~main memory]] concurrently read/write conflicts<ref name=":0">{{Cite ~~journal~~book\|~~last~~last1=Arge\|~~first~~first1=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari~~\|date=2008~~\|title~~=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal~~=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures \|chapter=Fundamental parallel algorithms for private-cache ~~SPAA~~chip ~~'08~~multiprocessors \|date=2008\|pages=197–206\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739\|s2cid=11067041 }}</ref> occur. Like in the ~~[[parallel random-access machine \|~~ PRAM model]], three different variations of this problem are considered: Concurrent Read Concurrent Write (CRCW): The same block in [[External memory algorithm \|main memory]] can be read and written by multiple processors concurrently.▼ Concurrent Read ~~Exclusive~~Concurrent Write (~~CREW~~CRCW): The same block in ~~[[External memory algorithm \|~~main memory]] can be read and written by multiple processors concurrently~~. Only one processor can write to a block at a time~~. ~~Exclusive~~ Concurrent Read Exclusive Write (~~EREW~~CREW): The same block in ~~[[External memory algorithm \|~~main memory]] ~~cannot~~can be read ~~or written~~ by multiple processors concurrently. Only one processor can ~~access~~write to a block at a time. ▲~~Concurrent~~ Exclusive Read ~~Concurrent~~Exclusive Write (~~CRCW~~EREW): The same block in ~~[[External memory algorithm \|~~main memory]] ~~can~~cannot be read ~~and~~or written by multiple processors concurrently. Only one processor can access a block at a time. The following two algorithms<ref name=":0">{{Cite ~~journal~~book\|~~last~~last1=Arge\|~~first~~first1=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari~~\|date=2008~~\|title~~=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal~~=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures \|chapter=Fundamental parallel algorithms for private-cache ~~SPAA~~chip multiprocessors ~~'08~~\|date=2008\|pages=197–206\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739\|s2cid=11067041 }}</ref> solve the CREW and EREW problem if <math>P ≤\leq B</math> processors write to the same block simultaneously. A first approach is to serialize the write operations. Only one processor after the other ~~writs~~writes to the block. This results in a total of <math>P</math> parallel block transfers. A second approach needs <math>O(\log(P))</math> parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a [[Reduce (parallel pattern) \| binary tree fashion]] and gradually combine the data into a single block. In the first round <math>P</math> processors combine their blocks into <math>P/2</math> blocks. Then <math>P/2</math> processors combine ~~their~~the <math>P/2</math> blocks into <math>P/4</math>. This procedure is continued until all the data is combined in one block. === Comparison to other models === {\| class="wikitable" \|+ !Model !Multi-core !Cache-aware \|- \|[[Random-access machine]] (RAM) \|No \|No \|- \|[[Parallel random-access machine]] (PRAM) \|Yes \|No \|- \|[[External memory algorithm\|External memory]] (EM) \|No \|Yes \|- ▲= \|'''Parallel external memory (~~model~~PEM) =''' \|Yes \|Yes \|} == Examples == <!-- Discuss if code folding is ok with respect to: https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style#Scrolling_lists_and_collapsible_content --> ~~=== Prefixsum ===~~ Let A be an ordered set of N elements. The [[Prefix sum \| prefix sum]] of A is an ordered set B of N elements, with <math display="inline">B[i]=\sum_{j=0}^i A[j]</math> and <math display="inline">0 \leq i < N</math>. If the input set A is located in continuous [[External memory algorithm \|main memory]], the [[Prefix sum \| prefix sum]] of A can be calculated in the PEM model with the optimal <math>O(\frac{N}{PB} + \log(P))</math> I/O complexity.<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> This optimal I/O complexity can be accomplished by simulating an optimal [[Parallel Random Access Machine \| PRAM]] prefix sum algorithm in the PEM model.<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> === Multiway partitioning === Let <math>M=\{m_1,...,m_{d-1}\}</math> be a vector of d-1 pivots sorted in increasing order. Let ~~<math>~~{{mvar\|A~~</math>~~}} be aman unordered set of N elements. A d-way partition<ref name=":0">{{Cite ~~journal~~book\|~~last~~last1=Arge\|~~first~~first1=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari~~\|date=2008~~\|title~~=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal~~=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures \|chapter=Fundamental parallel algorithms for private-cache ~~SPAA~~chip multiprocessors ~~'08~~\|date=2008\|pages=197–206\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739\|s2cid=11067041 }}</ref> of ~~<math>~~{{mvar\|A~~</math>~~}} is a set <math>\Pi=\{A_1,...,A_d\}</math> , where <math>\cup_{i=1}^d A_i = A</math> and <math>A_i\cap A_j=\emptyset</math> for <math>1\leq i<j\leq d</math>. <math>A_i</math> is called the i-th bucket. The number of elements in <math>A_i</math> is greater than <math>m_{i-1}</math> and smaller than <math>m_{i}^2</math>. In the following algorithm<ref name=":0">{{Cite ~~journal~~book\|~~last~~last1=Arge\|~~first~~first1=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari~~\|date=2008~~\|title~~=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal~~=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures \|chapter=Fundamental parallel algorithms for private-cache ~~SPAA~~chip ~~'08~~multiprocessors \|date=2008\|pages=197–206\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739\|s2cid=11067041 }}</ref> the input is partitioned into N/P-sized contiguous segments <math>S_1,...,S_P</math> in main memory. The processor i primarily works on the segment <math>S_i</math>. The multiway partitioning algorithm (<code>PEM_DIST_SORT</code><ref name=":0">{{Cite book\|last1=Arge\|first1=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|title=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures \|chapter=Fundamental parallel algorithms for private-cache chip multiprocessors \|date=2008\|pages=197–206\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739\|s2cid=11067041 }}</ref>) uses a PEM [[prefix sum]] algorithm<ref name=":0">{{Cite book\|last1=Arge\|first1=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|title=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures \|chapter=Fundamental parallel algorithms for private-cache chip multiprocessors \|date=2008\|pages=197–206\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739\|s2cid=11067041 }}</ref> to calculate the prefix sum with the optimal <math>O\left(\frac{N}{PB} + \log P\right)</math> I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm. // Compute parallelly a d-way partition on the data segments <math>S_i</math> '''for each''' processor i '''in parallel do''' Read the vector of pivots ~~<math>~~{{mvar\|M~~</math>~~}} into the cache. Partition <math>S_i</math> into d buckets and let vector <math>M_i=\{j_1^i,...,j_d^i\}</math> ~~bet~~be the number of items in each bucket. '''end for''' Run PEM prefix sum on the set of vectors <math>\{M_1,...,M_P\}</math> simultaneously. // Use the prefix sum vector to compute the final partition '''for each''' processor i '''in parallel do''' Write elements <math>S_i</math> into memory locations offset appropriately by <math>M_{i-1}</math> and <math>M_{i}</math>. '''end for''' ~~Using the prefix sums stored in <math>M_P</math> the last processor P calculates the vector <math>B</math> of bucket sizes and returns it.~~ Using the prefix sums stored in <math>M_P</math> the last processor P calculates the vector {{mvar\|B}} of bucket sizes and returns it. If the vector of <math>d=O\left(\frac{M}{B}\right)</math> pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with <math>O\left(\frac{N}{PB} + \left\lceil \frac{d}{B} \right\rceil>\log(P)+d\log(B)\right)</math> I/O complexity. The content of the final buckets have to be located in contiguous memory. === Selection === The [[selection problem]] is about finding the k-th smallest item in an unordered list ~~<math>~~{{mvar\|A~~</math>~~}} of size ~~<math>~~{{mvar\|N~~</math>~~}}. The following code<ref name=":0" /> makes use of <code>PRAMSORT</code> which is a PRAM optimal sorting algorithm which runs in <math>O(\log N)</math>, and <code>SELECT</code>, which is a cache optimal single-processor selection algorithm. '''if''' <math>N \leq P</math> '''then''' <math>\texttt{PRAMSORT}(A,P)</math> '''return''' <math>A[k]</math> '''end if''' ~~'''for~~//Find median of each~~''' processor~~ <math>iS_i</math> ~~'''in parallel do'''~~ '''for each''' processor {{mvar\|i}} '''in parallel do''' ~~//Find median of each <math>S_i</math>~~ <math>m_i = \texttt{SELECT}(S_i, \frac{N}{2P}) </math> '''end for''' Line 54 ⟶ 86: '''if''' <math>k \leq t</math> '''then''' '''return''' <math>\texttt{PEMSELECT}(A[1:t], P, k)</math> '''else''' '''return''' <math>\texttt{PEMSELECT}(A[t+1:N], P, k-t)</math> '''end if''' Under the assumption that the input is stored in contiguous memory, <code>PEMSELECT</code> has an I/O complexity of: : <math>O\left(\frac{N}{PB} + \log (PB) \cdot \log(\frac{N}{P})\right)</math~~><br /~~> === Distribution sort === [[Distribution sort]] partitions an input list ~~<math>~~{{mvar\|A~~</math>~~}} of size ~~<math>~~{{mvar\|N~~</math>~~}} into ~~<math>~~{{mvar\|d~~</math>~~}} disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list. If <math>P = 1</math> the task is delegated to a cache-optimal single-processor sorting algorithm. Otherwise the following algorithm<ref name=":0" /> is used: // Sample <math>\tfrac{4N}{\sqrt{d}}</math> elements from ~~<math>~~{{mvar\|A~~</math>~~}} '''for''' '''each''' processor ~~<math>~~{{mvar\|i~~</math>~~}} '''in parallel do''' '''if''' <math>M < \|S_i\|</math> '''then''' <math>d = M/B</math> Load <math>S_i</math> in ~~<math>~~{{mvar\|M~~</math>~~}}-sized pages and sort pages individually '''else''' <math>d = \|S_i\|</math> Load and sort <math>S_i</math> as single page '''end if''' Pick every <math>\sqrt{d}/4</math>'th element from each sorted memory page into contiguous vector <math>R^i</math> of samples '''end for''' '''in parallel do''' Combine vectors <math>R^1 \dots R^P</math> into a single contiguous vector <math>\mathcal{R}</math> Make <math>\sqrt{d}</math> copies of <math>\mathcal{R}</math>: <math>\mathcal{R}_1 \dots \mathcal{R}_{\sqrt{d}}</math> '''end do''' // Find <math>\sqrt{d}</math> pivots <math>\mathcal{M}[j]</math> '''for''' <math>j = 1</math> to <math>\sqrt{d}</math> '''in parallel do''' <math>\mathcal{M}[j] = \texttt{PEMSELECT}(\mathcal{R}_i, \tfrac{P}{\sqrt{d}}, \tfrac{j \cdot 4N}{d})</math> '''end for''' Pack pivots in contiguous array <math>\mathcal{M}</math> // Partition ~~<math>~~{{mvar\|A~~</math>~~}}around pivots into buckets <math>\mathcal{B}</math> <math>\mathcal{B} = \texttt{PEMMULTIPARTITION}(A[1:N],\mathcal{M},\sqrt{d},P)</math> // Recursively sort buckets '''for''' <math>j = 1</math> to <math>\sqrt{d} + 1</math> '''in parallel do''' recursively call <math>\texttt{PEMDISTSORT}</math> on bucket ~~<math>~~{{mvar\|j~~</math>~~}}of size <math>\mathcal{B}[j]</math> using <math>O \left( \left \lceil \tfrac{\mathcal{B}[j]}{N / P} \right \rceil \right)</math> processors responsible for elements in bucket ~~<math>~~{{mvar\|j~~</math>~~}} '''end for''' The I/O complexity of <code>PEMDISTSORT</code> is: :<math>O \left( \left \lceil \frac{N}{PB} \right \rceil \left ( \log_d P + \log_{M/B} \frac{N}{PB} \right ) + f(N,P,d) \cdot \log_d P \right)</math> where :<math>f(N,P,d) = O \left ( \log \frac{PB}{\sqrt{d}} \log \frac{N}{P} + \left \lceil \frac{\sqrt{d}}{B} \log P + \sqrt{d} \log B \right \rceil \right )</math> If the number of processors is chosen that <math>f(N,P,d) = O\left ( \left \lceil \tfrac{N}{PB} \right \rceil \right )</math>and <math>M < B^{O(1)}</math> the I/O complexity is then: Line 124 ⟶ 156: \|<math>P \leq \frac{N}{B^2}, M = B^{O(1)}</math> \|- ![[List ranking]]<ref name=":1">{{Cite ~~journal~~book\|~~last~~last1=Arge\|~~first~~first1=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Sitchinava\|first3=Nodari~~\|date=2010~~\|title~~=Parallel external memory graph algorithms\|url=http://dx.doi.org/10.1109/ipdps.2010.5470440\|journal~~=2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) \|chapter=Parallel external memory graph algorithms \|date=2010\|pages=1–11\|publisher=IEEE\|doi=10.1109/ipdps.2010.5470440\|isbn=9781424464425\|s2cid=587572 }}</ref> \|<math>O \left ( \textrm{sort}_P(N) \right )</math> \|<math>P \leq \frac{N/B^2}{\log B \cdot \log^{O(1)} N}, M = B^{O(1)}</math> Line 140 ⟶ 172: \|<math>p \leq \frac{\|V\|+\|E\|}{B^2 \log B \cdot \log^{O(1)} N}, M = B^{O(1)}</math> \|} Where <math>\textrm{sort}_P(N)</math> is the time it takes to sort ~~<math>~~{{mvar\|N~~</math>~~}} items with ~~<math>~~{{mvar\|P~~</math>~~}} processors in the PEM model. == See also == * [[Parallel random-access machine]] (PRAM) * ... * [[Random-access machine]] (RAM) * [[External memory algorithm\|External memory]] (EM) ==References== {{reflist}} {{Parallel Computing}} [[Category:Algorithms]] [[Category:Models of computation]] [[Category:Analysis of parallel algorithms]] [[Category:External memory algorithms]] [[Category:Cache (computing)]]