Parallel external memory: Difference between revisions

Content deleted Content added
Merch173 (talk | contribs)
Merch173 (talk | contribs)
Line 20:
<!-- Discuss if code folding is ok with respect to: https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style#Scrolling_lists_and_collapsible_content -->
 
=== Prefixsum ===
Let A be an ordered set of N elements. The [[Prefix sum | prefix sum]] of A is an ordered set B of N elements, with <math display="inline">B[i]=\sum_{j=0}^i A[j]</math> and <math display="inline">0 \leq i < N</math>. If the input set A is located in continuous [[External memory algorithm |main memory]], the [[Prefix sum | prefix sum]] of A can be calculated in the PEM model with the optimal <math>O(\frac{N}{PB} + \log(P))</math> I/O complexity.<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> This optimal I/O complexity can be accomplished by simulating an optimal [[Parallel Random Access Machine | PRAM]] prefix sum algorithm in the PEM model.<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref>
 
=== Multiway partitioning ===
Let <math>M=\{m_1,...,m_{d-1}\}</math> be a vector of d-1 pivots sorted in increasing order. Let <math>A</math> be am unordered set of N elements. A d-way partition<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> of <math>A</math> is a set <math>\Pi=\{A_1,...,A_d\}</math> , where <math>\cup_{i=1}^d A_i = A</math> and <math>A_i\cap A_j=\emptyset</math> for <math>1\leq i<j\leq d</math>. <math>A_i</math> is called the i-th bucket. The number of elements in <math>A_i</math> is greater than <math>m_{i-1}</math> and smaller than <math>m_{i}^2</math>. In the following algorithm<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> the input is partitioned into N/P-sized contiguous segments <math>S_1,...,S_P</math> in main memory. The processor i primarily works on the segment <math>S_i</math>. The multiway partitioning algorithm (<code>PEM_DIST_SORT</code>)<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> uses a PEM [[Prefix sum | prefix sum]] algorithm to calculate the prefix sum with the optimal <math>O(\frac{N}{PB} + \log(P))</math> I/O complexity.
//compute parallelly a d-way partition on the data segments <math>S_i</math>
'''for each''' processor i '''in parallel do'''
Read the vector of pivots <math>M</math> into the cache.
Partition <math>S_i</math> into d buckets and let vector <math>M_i=\{j_1^i,...,j_d^i\}</math> bet the number of items in each bucket.
'''end for'''
Run PEM prefix sum on the set of vectors <math>\{M_1,...,M_P\}</math> simultaneously.
//use the prefix sum vector to compute the final partition
'''for each''' processor i '''in parallel do'''
Write elements <math>S_i</math> into memory locations offset appropriately by <math>M_{i-1}</math> and <math>M_{i}</math>.
'''end for'''
 
Using the prefix sums stored in <math>M_P</math> the last processor P calculates the vector <math>B</math> of bucket sizes and returns it.
 
If the vector of <math>d=O(\frac{M}{B})</math> pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with <math>O(\frac{N}{PB} + \lceil \frac{d}{B} \rceil>\log(P)+d\log(B))</math> I/O complexity. The content of the final buckets have to be located in contiguous memory.