Revision as of 13:41, 21 October 2020 edit Frap (talk \| contribs) Extended confirmed users, File movers, Pending changes reviewers, Rollbackers 35,582 edits Format code ← Previous edit		Revision as of 21:04, 6 January 2021 edit undo WikiCleanerBot (talk \| contribs) Bots 1,007,735 edits m v2.04b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation - Link equal to linktext) Tag: WPCleaner Next edit →
Line 6: == Model == === Definition === The PEM model<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|journal=Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures - SPAA '08\|pages=197\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of <math>P</math> processors and a two-level [[~~Memory hierarchy \|~~ memory hierarchy]]. This memory hierarchy consists of a large [[External memory algorithm \| external memory]] (main memory) of size <math>N</math> and <math>P</math> small [[Cache (computing) \| internal memories (caches)]]. The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size <math>M</math> which is partitioned in blocks of size <math>B</math>. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size <math>B</math>. === I/O complexity === The [[Programming complexity \| complexity measure]] of the PEM model is the I/O complexity,<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|journal=Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures - SPAA '08\|pages=197\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref>, which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if <math>P</math> processors load parallelly a data block of size <math>B</math> form the main memory into their caches, it is considered as an I/O complexity of <math>O(1)</math> not <math>O(P)</math>. A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches. === Read/write conflicts === Line 48: === Multiway partitioning === Let <math>M=\{m_1,...,m_{d-1}\}</math> be a vector of d-1 pivots sorted in increasing order. Let <math>A</math> be an unordered set of N elements. A d-way partition<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|journal=Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures - SPAA '08\|pages=197\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> of <math>A</math> is a set <math>\Pi=\{A_1,...,A_d\}</math> , where <math>\cup_{i=1}^d A_i = A</math> and <math>A_i\cap A_j=\emptyset</math> for <math>1\leq i<j\leq d</math>. <math>A_i</math> is called the i-th bucket. The number of elements in <math>A_i</math> is greater than <math>m_{i-1}</math> and smaller than <math>m_{i}^2</math>. In the following algorithm<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|journal=Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures - SPAA '08\|pages=197\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> the input is partitioned into N/P-sized contiguous segments <math>S_1,...,S_P</math> in main memory. The processor i primarily works on the segment <math>S_i</math>. The multiway partitioning algorithm (<code>PEM_DIST_SORT</code><ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|journal=Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures - SPAA '08\|pages=197\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref>) uses a PEM [[~~Prefix sum \|~~ prefix sum]] algorithm<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|journal=Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures - SPAA '08\|pages=197\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> to calculate the prefix sum with the optimal <math>O(\frac{N}{PB} + \log(P))</math> I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm. // Compute parallelly a d-way partition on the data segments <math>S_i</math> '''for each''' processor i '''in parallel do'''

Parallel external memory: Difference between revisions