Revision as of 21:00, 21 January 2019 edit Mwoelkde (talk \| contribs) 65 edits Removed collapse templates according to https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style#Scrolling_lists_and_collapsible_content Tag: Visual edit: Switched ← Previous edit		Revision as of 14:50, 22 January 2019 edit undo 141.52.104.93 (talk) No edit summary Next edit →
Line 3: == Model == === Definition === ~~...~~ The '''parallel external memory (PEM) model'''<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> is a combination of the [[External memory algorithm\|external memory]] (EM) model and the [[Parallel Random Access Machine \| parallel random access memory]] (PRAM) model. The '''parallel external memory (PEM) model''' is a computation model which consists of P processors and a two-level [[Memory hierarchy \| memory hierarchy]]. This [[Memory hierarchy \| memory hierarchy]] consists of a large [[External memory algorithm \| external memory]] (main memory) of size N and P small [[Cache (computing) \| internal memories (caches)]]. The [[External memory algorithm \|main]] memory is shared by all the processors. Each [[Cache (computing) \|cache]] is exclusive to a single processor. A processor cannot access another’s [[Cache (computing) \|cache]]. The [[Cache (computing) \|caches]] have a size M which is partitioned in blocks of size B. The processors can only perform operations on data which are in their [[Cache (computing) \|cache]]. The data can be transferred between the [[External memory algorithm \|main memory]] and the [[Cache (computing) \|cache]] in blocks of size B. === I/O complexity === The [[Programming complexity \| complexity measure]] of the PEM model is the I/O complexity<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref>, which determines the number of parallel blocks transfers between the [[External memory algorithm \|main memory]] and the [[Cache (computing) \|cache]]. During a parallel block transfer each processor can transfer a block. So if P processors load parallelly a data block of size B form the [[External memory algorithm \|main memory]] into their [[Cache (computing) \|caches]], it is considered as an I/O complexity of <math>O(1)</math> not <math>O(P)</math>. A program in the PEM model should minimize the data transfer between [[External memory algorithm \|main memory]] and [[Cache (computing) \|caches]] and operate as much as possible on the data in the [[Cache (computing) \|caches]]. === Read / Write conflicts === In the PEM model, there is no [[Computer network \| direct communication network]] between the P processors. The processors have to communicate indirectly over the [[External memory algorithm \|main memory]]. If multiple processors try to access the same block in [[External memory algorithm \|main memory]] concurrently read/write conflicts<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> occur. Like in the [[parallel random-access machine \| PRAM model]] three different variations of this problem are considered: ~~...~~ Concurrent Read Concurrent Write (CRCW): The same block in [[External memory algorithm \|main memory]] can be read and written by multiple processors concurrently. Concurrent Read Exclusive Write (CREW): The same block in [[External memory algorithm \|main memory]] can be read by multiple processors concurrently. Only one processor can write to a block at a time. ~~=== Complexity measure ===~~ *Exclusive Read Exclusive Write (EREW): The same block in [[External memory algorithm \|main memory]] cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time. ~~...~~ The following two algorithms<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> solve the CREW and EREW problem if P ≤ B processors write to the same block simultaneously. A first approach is to serialize the writes. Only one processor after the other writs to the block. This results in a total of P parallel block transfers. A second approach needs <math>O(\log(P))</math> parallel block transfers and an additional block for each processor. The main idea is to schedule the writes in a binary tree fashion and gradually combine the data into a single block. In the first round P processors combine their blocks into P/2 blocks. Then P/2 processors combine their blocks into P/4. This procedure is continued until all the data is combined in one block. == Examples == Line 15 ⟶ 19: === Prefixsum === Let A be an ordered set of N elements. The [[Prefix sum \| prefix sum]] of A is an ordered set B of N elements, with <math display="inline">B[i]=\sum_{j=0}^i A[j]</math> and <math display="inline">0 \leq i < N</math>. If the input set A is located in continuous [[External memory algorithm \|main memory]], the [[Prefix sum \| prefix sum]] of A can be calculated in the PEM model with the optimal <math>O(\frac{N}{PB} + \log(P))</math> I/O complexity.<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> This optimal I/O complexity can be accomplished by simulating an optimal [[Parallel Random Access Machine \| PRAM]] prefix sum algorithm in the PEM model.<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> ~~...~~ === Multiway partitioning === Let <math>M=\{m_1,...,m_{d-1}\}</math> be a vector of d-1 pivots sorted in increasing order. Let <math>A</math> be am unordered set of N elements. A d-way partition<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> of <math>A</math> is a set <math>\Pi=\{A_1,...,A_d\}</math> , where <math>\cup_{i=1}^d A_i = A</math> and <math>A_i\cap A_j=\emptyset</math> for <math>1\leq i<j\leq d</math>. <math>A_i</math> is called the i-th bucket. The number of elements in <math>A_i</math> is greater than <math>m_{i-1}</math> and smaller than <math>m_{i}^2</math>. In the following algorithm<ref name=":0">{{Cite journal\|last=Arge\|first=Lars\|last2=Goodrich\|first2=Michael T.\|last3=Nelson\|first3=Michael\|last4=Sitchinava\|first4=Nodari\|date=2008\|title=Fundamental parallel algorithms for private-cache chip multiprocessors\|url=http://dx.doi.org/10.1145/1378533.1378573\|journal=Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08\|___location=New York, New York, USA\|publisher=ACM Press\|doi=10.1145/1378533.1378573\|isbn=9781595939739}}</ref> the input is partitioned into N/P-sized contiguous segments <math>S_1,...,S_P</math> in main memory. The processor i primarily works on the segment <math>S_i</math>. ~~...~~ '''for each''' processor i '''in parallel do''' Read the vector of pivots <math>M</math> into the cache. Partition <math>S_i</math> into d buckets and let vector <math>M_i=\{j_1^i,...,j_d^i\}</math> bet the number of items in each bucket. '''end for''' Run PEM prefix sum on the set of vectors <math>\{M_1,...,M_P\}</math> simultaneously. '''for each''' processor i '''in parallel do''' Write elements <math>S_i</math> into memory locations offset appropriately by <math>M_{i-1}</math> and <math>M_{i}</math>. '''end for''' Using the prefix sums stored in <math>M_P</math> the last processor P calculates the vector <math>B</math> of bucket sizes and returns it. If the vector of <math>d=O(\frac{M}{B})</math> pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with <math>O(\frac{N}{PB} + \lceil \frac{d}{B} \rceil>\log(P)+d\log(B))</math> I/O complexity. The content of the final buckets have to be located in contiguous memory. === Selection ===

Parallel external memory: Difference between revisions