Revision as of 12:01, 23 January 2019 edit Merch173 (talk \| contribs) 11 edits →Read / Write conflicts ← Previous edit		Revision as of 12:11, 23 January 2019 edit undo Merch173 (talk \| contribs) 11 edits →Examples Next edit →
Line 20: <!-- Discuss if code folding is ok with respect to: https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style#Scrolling_lists_and_collapsible_content --> ~~=== Prefixsum ===~~ Let A be an ordered set of N elements. The [[Prefix sum \| prefix sum]] of A is an ordered set B of N elements, with <math display="inline">B[i]=\sum_{j=0}^i A[j]</math> and <math display="inline">0 \leq i < N</math>. If the input set A is located in continuous [[External memory algorithm \|main memory]], the [[Prefix sum \| prefix sum]] of A can be calculated in the PEM model with the optimal <math>O(\frac{N}{PB} + \log(P))</math> I/O complexity.<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> This optimal I/O complexity can be accomplished by simulating an optimal [[Parallel Random Access Machine \| PRAM]] prefix sum algorithm in the PEM model.<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> === Multiway partitioning === Let <math>M=\{m_1,...,m_{d-1}\}</math> be a vector of d-1 pivots sorted in increasing order. Let <math>A</math> be am unordered set of N elements. A d-way partition<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> of <math>A</math> is a set <math>\Pi=\{A_1,...,A_d\}</math> , where <math>\cup_{i=1}^d A_i = A</math> and <math>A_i\cap A_j=\emptyset</math> for <math>1\leq i<j\leq d</math>. <math>A_i</math> is called the i-th bucket. The number of elements in <math>A_i</math> is greater than <math>m_{i-1}</math> and smaller than <math>m_{i}^2</math>. In the following algorithm<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> the input is partitioned into N/P-sized contiguous segments <math>S_1,...,S_P</math> in main memory. The processor i primarily works on the segment <math>S_i</math>. The multiway partitioning algorithm (<code>PEM_DIST_SORT</code>)<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> uses a PEM [[Prefix sum \| prefix sum]] algorithm to calculate the prefix sum with the optimal <math>O(\frac{N}{PB} + \log(P))</math> I/O complexity. //compute parallelly a d-way partition on the data segments <math>S_i</math> '''for each''' processor i '''in parallel do''' Read the vector of pivots <math>M</math> into the cache. Partition <math>S_i</math> into d buckets and let vector <math>M_i=\{j_1^i,...,j_d^i\}</math> bet the number of items in each bucket. '''end for''' Run PEM prefix sum on the set of vectors <math>\{M_1,...,M_P\}</math> simultaneously. //use the prefix sum vector to compute the final partition '''for each''' processor i '''in parallel do''' Write elements <math>S_i</math> into memory locations offset appropriately by <math>M_{i-1}</math> and <math>M_{i}</math>. '''end for''' Using the prefix sums stored in <math>M_P</math> the last processor P calculates the vector <math>B</math> of bucket sizes and returns it. If the vector of <math>d=O(\frac{M}{B})</math> pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with <math>O(\frac{N}{PB} + \lceil \frac{d}{B} \rceil>\log(P)+d\log(B))</math> I/O complexity. The content of the final buckets have to be located in contiguous memory.

Parallel external memory: Difference between revisions