Cache prefetching: Difference between revisions

Content deleted Content added
Stream buffers: add "the"
no "we"
Line 20:
* Stream [[Data buffer|buffers]] are one of the most common hardware based prefetching techniques in use. This technique was originally proposed by [[Norman Jouppi]] in 1990<ref name=":1">{{cite conference |last=Jouppi |first=Norman P. |year=1990 |title=Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers |conference=17th annual international symposium on Computer Architecture – ISCA 1990 |___location=New York, New York, USA |publisher=Association for Computing Machinery Press |pages=364–373 |citeseerx=10.1.1.37.6114 |doi=10.1145/325164.325162 |isbn=0-89791-366-3 |book-title=Proceedings of the 17th annual international symposium on Computer Architecture – ISCA 1990}}</ref> and many variations of this method have been developed since.<ref>{{Cite journal|last1=Chen|first1=Tien-Fu|last2=Baer|first2=Jean-Loup|s2cid=1450745|date=1995-05-01|title=Effective hardware-based data prefetching for high-performance processors|journal=IEEE Transactions on Computers|volume=44|issue=5|pages=609–623|doi=10.1109/12.381947|issn=0018-9340}}</ref><ref>{{Cite conference |last1=Palacharla |first1=S. |last2=Kessler |first2=R. E. |date=1994-01-01 |title=Evaluating Stream Buffers As a Secondary Cache Replacement |conference=21st Annual International Symposium on Computer Architecture |___location=Chicago, Illinois, USA |publisher=IEEE Computer Society Press |pages=24–33 |citeseerx=10.1.1.92.3031 |doi=10.1109/ISCA.1994.288164 |isbn=978-0818655104}}</ref><ref name="grannaes">{{cite journal| last1=Grannaes | first1=Marius | last2=Jahre | first2=Magnus | last3=Natvig | first3=Lasse | title=Storage Efficient Hardware Prefetching using Delta-Correlating Prediction Tables |citeseerx=10.1.1.229.3483 |journal=Journal of Instruction-Level Parallelism |issue=13 |year=2011 |pages=1–16}}</ref> The basic idea is that the [[cache miss]] address (and <math>k</math> subsequent addresses) are fetched into a separate buffer of depth <math>k</math>. This buffer is called a stream buffer and is separate from the cache. The processor then consumes data/instructions from the stream buffer if the address associated with the prefetched blocks match the requested address generated by the program executing on the processor. The figure below illustrates this setup:
[[File:CachePrefetching_StreamBuffers.png|center|A typical stream buffer setup as originally proposed by Norman Jouppi in 1990<ref name=":1" />|alt=A typical stream buffer setup as originally proposed|thumb|upright=1.4]]
* Whenever the prefetch mechanism detects a miss on a memory block, say A, it allocates a stream to begin prefetching successive blocks from the missed block onward. If the stream buffer can hold 4 blocks, then wethe processor would prefetch A+1, A+2, A+3, A+4 and hold those in the allocated stream buffer. If the processor consumes A+1 next, then it shall be moved "up" from the stream buffer to the processor's cache. The first entry of the stream buffer would now be A+2 and so on. This pattern of prefetching successive blocks is called '''Sequential Prefetching'''. It is mainly used when contiguous locations are to be prefetched. For example, it is used when prefetching instructions.
* This mechanism can be scaled up by adding multiple such 'stream buffers' - each of which would maintain a separate prefetch stream.<ref>{{Cite conference |last1=Ishii |first1=Yasuo |last2=Inaba |first2=Mary |last3=Hiraki |first3=Kei |date=2009-06-08 |title=Access map pattern matching for data cache prefetch |url=https://doi.org/10.1145/1542275.1542349 |conference=ICS 2009 |___location=New York, New York, USA |publisher=Association for Computing Machinery |pages=499–500 |doi=10.1145/1542275.1542349 |isbn=978-1-60558-498-0 |book-title=Proceedings of the 23rd International Conference on Supercomputing |s2cid=37841036}}</ref> For each new miss, there would be a new stream buffer allocated and it would operate in a similar way as described above.
* The ideal depth of the stream buffer is something that is subject to experimentation against various benchmarks<ref name=":1" /> and depends on the rest of the [[microarchitecture]] involved.<ref>{{Cite conference |last1=Srinath |first1=Santhosh |last2=Mutlu |first2=Onur |last3=Kim |first3=Hyesoon |author3-link=Hyesoon Kim|last4=Patt |first4=Yale N.|author4-link=Yale Patt |date=February 2007 |title=Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers |url=https://ieeexplore.ieee.org/document/4147648 |conference=2007 IEEE 13th International Symposium on High Performance Computer Architecture |pages=63–74 |doi=10.1109/HPCA.2007.346185|isbn=978-1-4244-0804-7 |s2cid=6909725 }}</ref>
Line 54:
array1[i] = 2 * array1[i];
}
</syntaxhighlight>At each iteration, the i<sup>th</sup> element of the array "array1" is accessed. Therefore, wethe system can prefetch the elements that are going to be accessed in future iterations by inserting a "prefetch" instruction as shown below:<syntaxhighlight lang="c++">
for (int i=0; i<1024; i++) {
prefetch (array1 [i + k]);
array1[i] = 2 * array1[i];
}
</syntaxhighlight>Here, the prefetch stride, <math>k</math> depends on two factors, the cache miss penalty and the time it takes to execute a single iteration of the '''''for''''' loop. For instance, if one iteration of the loop takes 7 cycles to execute, and the cache miss penalty is 49 cycles then wethere should havebe <math>k = 49/7 = 7</math> - which means that wethe system should prefetch 7 elements ahead. With the first iteration, i will be 0, so wethe prefetchsystem prefetches the 7th element. Now, with this arrangement, the first 7 accesses (i=0->6) will still be misses (under the simplifying assumption that each element of array1 is in a separate cache line of its own).
 
== Comparison of hardware and software prefetching ==
Line 81:
<math>\text{Prefetch Accuracy} = \frac{\text{Cache Misses eliminated by prefetching}}{(\text{Useless Cache Prefetches}) + (\text{Cache Misses eliminated by prefetching})}</math>
 
While it appears that having perfect accuracy might imply that there are no misses, this is not the case. The prefetches themselves might result in new misses if the prefetched blocks are placed directly into the cache. Although these may be a small fraction of the total number of misses we might seeobserved without any prefetching, this is a non-zero number of misses.
 
=== Timeliness ===
The qualitative definition of timeliness is how early a block is prefetched versus when it is actually referenced. An example to further explain timeliness is as follows:
 
Consider a for loop where each iteration takes 3 cycles to execute and the 'prefetch' operation takes 12 cycles. This implies that for the prefetched data to be useful, wethe system must start the prefetch <math>12/3 = 4</math> iterations prior to its usage to maintain timeliness.
 
== See also ==