Parallel external memory: Difference between revisions

Content deleted Content added
Format code
WikiCleanerBot (talk | contribs)
m v2.04b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation - Link equal to linktext)
Line 6:
== Model ==
=== Definition ===
The PEM model<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures - SPAA '08|pages=197|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of <math>P</math> processors and a two-level [[Memory hierarchy | memory hierarchy]]. This memory hierarchy consists of a large [[External memory algorithm | external memory]] (main memory) of size <math>N</math> and <math>P</math> small [[Cache (computing) | internal memories (caches)]]. The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size <math>M</math> which is partitioned in blocks of size <math>B</math>. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size <math>B</math>.
 
=== I/O complexity ===
The [[Programming complexity | complexity measure]] of the PEM model is the I/O complexity,<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures - SPAA '08|pages=197|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref>, which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if <math>P</math> processors load parallelly a data block of size <math>B</math> form the main memory into their caches, it is considered as an I/O complexity of <math>O(1)</math> not <math>O(P)</math>. A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.
 
=== Read/write conflicts ===
Line 48:
 
=== Multiway partitioning ===
Let <math>M=\{m_1,...,m_{d-1}\}</math> be a vector of d-1 pivots sorted in increasing order. Let <math>A</math> be an unordered set of N elements. A d-way partition<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures - SPAA '08|pages=197|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> of <math>A</math> is a set <math>\Pi=\{A_1,...,A_d\}</math> , where <math>\cup_{i=1}^d A_i = A</math> and <math>A_i\cap A_j=\emptyset</math> for <math>1\leq i<j\leq d</math>. <math>A_i</math> is called the i-th bucket. The number of elements in <math>A_i</math> is greater than <math>m_{i-1}</math> and smaller than <math>m_{i}^2</math>. In the following algorithm<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures - SPAA '08|pages=197|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> the input is partitioned into N/P-sized contiguous segments <math>S_1,...,S_P</math> in main memory. The processor i primarily works on the segment <math>S_i</math>. The multiway partitioning algorithm (<code>PEM_DIST_SORT</code><ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures - SPAA '08|pages=197|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref>) uses a PEM [[Prefix sum | prefix sum]] algorithm<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|journal=Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures - SPAA '08|pages=197|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> to calculate the prefix sum with the optimal <math>O(\frac{N}{PB} + \log(P))</math> I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm.
// Compute parallelly a d-way partition on the data segments <math>S_i</math>
'''for each''' processor i '''in parallel do'''