Parallel external memory: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Alter: pages. Add: s2cid, authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Whoop whoop pull up | #UCB_webform 361/3352
m Examples: : <math> \left \right
 
(2 intermediate revisions by one other user not shown)
Line 1:
[[File:Parallel External Memory Model PEM.png|thumb|400x400px|PEM Model]]
In computer science, a '''parallel external memory (PEM) model''' is a [[Cache-aware model|cache-aware]], external-memory [[abstract machine]].<ref name=":0">{{Cite journalbook|last1=Arge|first1=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentiethtwentieth Annualannual Symposiumsymposium on Parallelism in Algorithmsalgorithms and Architecturesarchitectures |chapter=Fundamental parallel algorithms for private-cache SPAAchip '08multiprocessors |date=2008|pages=197–206|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739|s2cid=11067041 }}</ref> It is the parallel-computing analogy to the single-processor [[External memory algorithm|external memory]] (EM) model. In a similar way, it is the cache-aware analogy to the [[parallel random-access machine]] (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.
 
__TOC__
Line 6:
== Model ==
=== Definition ===
The PEM model<ref name=":0">{{Cite journalbook|last1=Arge|first1=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentiethtwentieth Annualannual Symposiumsymposium on Parallelism in Algorithmsalgorithms and Architecturesarchitectures |chapter=Fundamental parallel algorithms for private-cache SPAAchip '08multiprocessors |date=2008|pages=197–206|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739|s2cid=11067041 }}</ref> is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of <math>P</math> processors and a two-level [[memory hierarchy]]. This memory hierarchy consists of a large [[External memory algorithm | external memory]] (main memory) of size <math>N</math> and <math>P</math> small [[Cache (computing) | internal memories (caches)]]. The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size <math>M</math> which is partitioned in blocks of size <math>B</math>. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size <math>B</math>.
 
=== I/O complexity ===
The [[Programming complexity | complexity measure]] of the PEM model is the I/O complexity,<ref name=":0">{{Cite journalbook|last1=Arge|first1=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentiethtwentieth Annualannual Symposiumsymposium on Parallelism in Algorithmsalgorithms and Architecturesarchitectures |chapter=Fundamental parallel algorithms for private-cache SPAAchip '08multiprocessors |date=2008|pages=197–206|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739|s2cid=11067041 }}</ref> which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if <math>P</math> processors load parallelly a data block of size <math>B</math> form the main memory into their caches, it is considered as an I/O complexity of <math>O(1)</math> not <math>O(P)</math>. A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.
 
=== Read/write conflicts ===
In the PEM model, there is no [[Computer network | direct communication network]] between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts<ref name=":0">{{Cite journalbook|last1=Arge|first1=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentiethtwentieth Annualannual Symposiumsymposium on Parallelism in Algorithmsalgorithms and Architecturesarchitectures |chapter=Fundamental parallel algorithms for private-cache SPAAchip '08multiprocessors |date=2008|pages=197–206|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739|s2cid=11067041 }}</ref> occur. Like in the PRAM model, three different variations of this problem are considered:
* Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.
* Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.
* Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.
The following two algorithms<ref name=":0">{{Cite journalbook|last1=Arge|first1=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentiethtwentieth Annualannual Symposiumsymposium on Parallelism in Algorithmsalgorithms and Architecturesarchitectures |chapter=Fundamental parallel algorithms for private-cache SPAAchip '08multiprocessors |date=2008|pages=197–206|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739|s2cid=11067041 }}</ref> solve the CREW and EREW problem if <math>P \leq B</math> processors write to the same block simultaneously.
A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of <math>P</math> parallel block transfers. A second approach needs <math>O(\log(P))</math> parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a [[Reduce (parallel pattern) | binary tree fashion]] and gradually combine the data into a single block. In the first round <math>P</math> processors combine their blocks into <math>P/2</math> blocks. Then <math>P/2</math> processors combine the <math>P/2</math> blocks into <math>P/4</math>. This procedure is continued until all the data is combined in one block.
 
Line 48:
 
=== Multiway partitioning ===
Let <math>M=\{m_1,...,m_{d-1}\}</math> be a vector of d-1 pivots sorted in increasing order. Let <math>{{mvar|A</math>}} be an unordered set of N elements. A d-way partition<ref name=":0">{{Cite journalbook|last1=Arge|first1=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentiethtwentieth Annualannual Symposiumsymposium on Parallelism in Algorithmsalgorithms and Architecturesarchitectures |chapter=Fundamental parallel algorithms for private-cache SPAAchip '08multiprocessors |date=2008|pages=197–206|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739|s2cid=11067041 }}</ref> of <math>{{mvar|A</math>}} is a set <math>\Pi=\{A_1,...,A_d\}</math> , where <math>\cup_{i=1}^d A_i = A</math> and <math>A_i\cap A_j=\emptyset</math> for <math>1\leq i<j\leq d</math>. <math>A_i</math> is called the i-th bucket. The number of elements in <math>A_i</math> is greater than <math>m_{i-1}</math> and smaller than <math>m_{i}^2</math>. In the following algorithm<ref name=":0">{{Cite journalbook|last1=Arge|first1=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentiethtwentieth Annualannual Symposiumsymposium on Parallelism in Algorithmsalgorithms and Architecturesarchitectures |chapter=Fundamental parallel algorithms for private-cache SPAAchip '08multiprocessors |date=2008|pages=197–206|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739|s2cid=11067041 }}</ref> the input is partitioned into N/P-sized contiguous segments <math>S_1,...,S_P</math> in main memory. The processor i primarily works on the segment <math>S_i</math>. The multiway partitioning algorithm (<code>PEM_DIST_SORT</code><ref name=":0">{{Cite journalbook|last1=Arge|first1=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentiethtwentieth Annualannual Symposiumsymposium on Parallelism in Algorithmsalgorithms and Architecturesarchitectures |chapter=Fundamental parallel algorithms for private-cache SPAAchip '08multiprocessors |date=2008|pages=197–206|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739|s2cid=11067041 }}</ref>) uses a PEM [[prefix sum]] algorithm<ref name=":0">{{Cite journalbook|last1=Arge|first1=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentiethtwentieth Annualannual Symposiumsymposium on Parallelism in Algorithmsalgorithms and Architecturesarchitectures |chapter=Fundamental parallel algorithms for private-cache SPAAchip '08multiprocessors |date=2008|pages=197–206|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739|s2cid=11067041 }}</ref> to calculate the prefix sum with the optimal <math>O\left(\frac{N}{PB} + \log( P)\right)</math> I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm.
// Compute parallelly a d-way partition on the data segments <math>S_i</math>
'''for each''' processor i '''in parallel do'''
Read the vector of pivots <math>{{mvar|M</math>}} into the cache.
Partition <math>S_i</math> into d buckets and let vector <math>M_i=\{j_1^i,...,j_d^i\}</math> be the number of items in each bucket.
'''end for'''
Line 62:
'''end for'''
Using the prefix sums stored in <math>M_P</math> the last processor P calculates the vector <math>{{mvar|B</math>}} of bucket sizes and returns it.
 
If the vector of <math>d=O\left(\frac{M}{B}\right)</math> pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with <math>O\left(\frac{N}{PB} + \left\lceil \frac{d}{B} \right\rceil>\log(P)+d\log(B)\right)</math> I/O complexity. The content of the final buckets have to be located in contiguous memory.
 
=== Selection ===
The [[selection problem]] is about finding the k-th smallest item in an unordered list <math>{{mvar|A</math>}} of size <math>{{mvar|N</math>}}.
The following code<ref name=":0" /> makes use of <code>PRAMSORT</code> which is a PRAM optimal sorting algorithm which runs in <math>O(\log N)</math>, and <code>SELECT</code>, which is a cache optimal single-processor selection algorithm.
'''if''' <math>N \leq P</math> '''then'''
Line 75:
//Find median of each <math>S_i</math>
'''for each''' processor <math>{{mvar|i</math>}} '''in parallel do'''
<math>m_i = \texttt{SELECT}(S_i, \frac{N}{2P}) </math>
'''end for'''
Line 93:
Under the assumption that the input is stored in contiguous memory, <code>PEMSELECT</code> has an I/O complexity of:
 
: <math>O\left(\frac{N}{PB} + \log (PB) \cdot \log(\frac{N}{P})\right)</math><br />
 
=== Distribution sort ===
[[Distribution sort]] partitions an input list <math>{{mvar|A</math>}} of size <math>{{mvar|N</math>}} into <math>{{mvar|d</math>}} disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.
 
If <math>P = 1</math> the task is delegated to a cache-optimal single-processor sorting algorithm.
 
Otherwise the following algorithm<ref name=":0" /> is used:
// Sample <math>\tfrac{4N}{\sqrt{d}}</math> elements from <math>{{mvar|A</math>}}
'''for''' '''each''' processor <math>{{mvar|i</math>}} '''in parallel do'''
'''if''' <math>M < |S_i|</math> '''then'''
<math>d = M/B</math>
Load <math>S_i</math> in <math>{{mvar|M</math>}}-sized pages and sort pages individually
'''else'''
<math>d = |S_i|</math>
Line 125:
Pack pivots in contiguous array <math>\mathcal{M}</math>
// Partition <math>{{mvar|A</math>}}around pivots into buckets <math>\mathcal{B}</math>
<math>\mathcal{B} = \texttt{PEMMULTIPARTITION}(A[1:N],\mathcal{M},\sqrt{d},P)</math>
// Recursively sort buckets
'''for''' <math>j = 1</math> to <math>\sqrt{d} + 1</math> '''in parallel do'''
recursively call <math>\texttt{PEMDISTSORT}</math> on bucket <math>{{mvar|j</math>}}of size <math>\mathcal{B}[j]</math>
using <math>O \left( \left \lceil \tfrac{\mathcal{B}[j]}{N / P} \right \rceil \right)</math> processors responsible for elements in bucket <math>{{mvar|j</math>}}
'''end for'''
 
The I/O complexity of <code>PEMDISTSORT</code> is:
 
:<math>O \left( \left \lceil \frac{N}{PB} \right \rceil \left ( \log_d P + \log_{M/B} \frac{N}{PB} \right ) + f(N,P,d) \cdot \log_d P \right)</math>
 
where
 
:<math>f(N,P,d) = O \left ( \log \frac{PB}{\sqrt{d}} \log \frac{N}{P} + \left \lceil \frac{\sqrt{d}}{B} \log P + \sqrt{d} \log B \right \rceil \right )</math>
 
If the number of processors is chosen that <math>f(N,P,d) = O\left ( \left \lceil \tfrac{N}{PB} \right \rceil \right )</math>and <math>M < B^{O(1)}</math> the I/O complexity is then:
Line 156:
|<math>P \leq \frac{N}{B^2}, M = B^{O(1)}</math>
|-
![[List ranking]]<ref name=":1">{{Cite journalbook|last1=Arge|first1=Lars|last2=Goodrich|first2=Michael T.|last3=Sitchinava|first3=Nodari|date=2010|title=Parallel external memory graph algorithms|journal=2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) |chapter=Parallel external memory graph algorithms |date=2010|pages=1–11|publisher=IEEE|doi=10.1109/ipdps.2010.5470440|isbn=9781424464425|s2cid=587572 }}</ref>
|<math>O \left ( \textrm{sort}_P(N) \right )</math>
|<math>P \leq \frac{N/B^2}{\log B \cdot \log^{O(1)} N}, M = B^{O(1)}</math>
Line 172:
|<math>p \leq \frac{|V|+|E|}{B^2 \log B \cdot \log^{O(1)} N}, M = B^{O(1)}</math>
|}
Where <math>\textrm{sort}_P(N)</math> is the time it takes to sort <math>{{mvar|N</math>}} items with <math>{{mvar|P</math>}} processors in the PEM model.
 
== See also ==