Parallel external memory: Difference between revisions

Content deleted Content added
Mwoelkde (talk | contribs)
Removed collapse templates according to https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style#Scrolling_lists_and_collapsible_content
No edit summary
Line 3:
 
== Model ==
=== Definition ===
...
The '''parallel external memory (PEM) model'''<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> is a combination of the [[External memory algorithm|external memory]] (EM) model and the [[Parallel Random Access Machine | parallel random access memory]] (PRAM) model. The '''parallel external memory (PEM) model''' is a computation model which consists of P processors and a two-level [[Memory hierarchy | memory hierarchy]]. This [[Memory hierarchy | memory hierarchy]] consists of a large [[External memory algorithm | external memory]] (main memory) of size N and P small [[Cache (computing) | internal memories (caches)]]. The [[External memory algorithm |main]] memory is shared by all the processors. Each [[Cache (computing) |cache]] is exclusive to a single processor. A processor cannot access another’s [[Cache (computing) |cache]]. The [[Cache (computing) |caches]] have a size M which is partitioned in blocks of size B. The processors can only perform operations on data which are in their [[Cache (computing) |cache]]. The data can be transferred between the [[External memory algorithm |main memory]] and the [[Cache (computing) |cache]] in blocks of size B.
 
=== I/O complexity ===
The [[Programming complexity | complexity measure]] of the PEM model is the I/O complexity<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref>, which determines the number of parallel blocks transfers between the [[External memory algorithm |main memory]] and the [[Cache (computing) |cache]]. During a parallel block transfer each processor can transfer a block. So if P processors load parallelly a data block of size B form the [[External memory algorithm |main memory]] into their [[Cache (computing) |caches]], it is considered as an I/O complexity of <math>O(1)</math> not <math>O(P)</math>. A program in the PEM model should minimize the data transfer between [[External memory algorithm |main memory]] and [[Cache (computing) |caches]] and operate as much as possible on the data in the [[Cache (computing) |caches]].
=== Read / Write conflicts ===
In the PEM model, there is no [[Computer network | direct communication network]] between the P processors. The processors have to communicate indirectly over the [[External memory algorithm |main memory]]. If multiple processors try to access the same block in [[External memory algorithm |main memory]] concurrently read/write conflicts<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> occur. Like in the [[parallel random-access machine | PRAM model]] three different variations of this problem are considered:
...
*Concurrent Read Concurrent Write (CRCW): The same block in [[External memory algorithm |main memory]] can be read and written by multiple processors concurrently.
 
*Concurrent Read Exclusive Write (CREW): The same block in [[External memory algorithm |main memory]] can be read by multiple processors concurrently. Only one processor can write to a block at a time.
=== Complexity measure ===
*Exclusive Read Exclusive Write (EREW): The same block in [[External memory algorithm |main memory]] cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.
...
The following two algorithms<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> solve the CREW and EREW problem if P ≤ B processors write to the same block simultaneously.
A first approach is to serialize the writes. Only one processor after the other writs to the block. This results in a total of P parallel block transfers. A second approach needs <math>O(\log(P))</math> parallel block transfers and an additional block for each processor. The main idea is to schedule the writes in a binary tree fashion and gradually combine the data into a single block. In the first round P processors combine their blocks into P/2 blocks. Then P/2 processors combine their blocks into P/4. This procedure is continued until all the data is combined in one block.
 
== Examples ==
Line 15 ⟶ 19:
 
=== Prefixsum ===
Let A be an ordered set of N elements. The [[Prefix sum | prefix sum]] of A is an ordered set B of N elements, with <math display="inline">B[i]=\sum_{j=0}^i A[j]</math> and <math display="inline">0 \leq i < N</math>. If the input set A is located in continuous [[External memory algorithm |main memory]], the [[Prefix sum | prefix sum]] of A can be calculated in the PEM model with the optimal <math>O(\frac{N}{PB} + \log(P))</math> I/O complexity.<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> This optimal I/O complexity can be accomplished by simulating an optimal [[Parallel Random Access Machine | PRAM]] prefix sum algorithm in the PEM model.<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref>
...
 
=== Multiway partitioning ===
Let <math>M=\{m_1,...,m_{d-1}\}</math> be a vector of d-1 pivots sorted in increasing order. Let <math>A</math> be am unordered set of N elements. A d-way partition<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> of <math>A</math> is a set <math>\Pi=\{A_1,...,A_d\}</math> , where <math>\cup_{i=1}^d A_i = A</math> and <math>A_i\cap A_j=\emptyset</math> for <math>1\leq i<j\leq d</math>. <math>A_i</math> is called the i-th bucket. The number of elements in <math>A_i</math> is greater than <math>m_{i-1}</math> and smaller than <math>m_{i}^2</math>. In the following algorithm<ref name=":0">{{Cite journal|last=Arge|first=Lars|last2=Goodrich|first2=Michael T.|last3=Nelson|first3=Michael|last4=Sitchinava|first4=Nodari|date=2008|title=Fundamental parallel algorithms for private-cache chip multiprocessors|url=http://dx.doi.org/10.1145/1378533.1378573|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08|___location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1378533.1378573|isbn=9781595939739}}</ref> the input is partitioned into N/P-sized contiguous segments <math>S_1,...,S_P</math> in main memory. The processor i primarily works on the segment <math>S_i</math>.
...
'''for each''' processor i '''in parallel do'''
Read the vector of pivots <math>M</math> into the cache.
Partition <math>S_i</math> into d buckets and let vector <math>M_i=\{j_1^i,...,j_d^i\}</math> bet the number of items in each bucket.
'''end for'''
Run PEM prefix sum on the set of vectors <math>\{M_1,...,M_P\}</math> simultaneously.
'''for each''' processor i '''in parallel do'''
Write elements <math>S_i</math> into memory locations offset appropriately by <math>M_{i-1}</math> and <math>M_{i}</math>.
'''end for'''
Using the prefix sums stored in <math>M_P</math> the last processor P calculates the vector <math>B</math> of bucket sizes and returns it.
If the vector of <math>d=O(\frac{M}{B})</math> pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with <math>O(\frac{N}{PB} + \lceil \frac{d}{B} \rceil>\log(P)+d\log(B))</math> I/O complexity. The content of the final buckets have to be located in contiguous memory.
 
=== Selection ===