Content deleted Content added
Guy Harris (talk | contribs) Use {{cite conference}} for conference papers. |
Citation bot (talk | contribs) Added article-number. Removed URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 361/1032 |
||
(26 intermediate revisions by 15 users not shown) | |||
Line 1:
{{short description|Computer processing technique to boost memory performance}}
'''Cache prefetching''' is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed (hence the term 'prefetch').<ref name=":3">{{Cite journal|last=Smith|first=Alan Jay|date=1982-09-01|title=Cache Memories|journal=ACM Comput. Surv.|volume=14|issue=3|pages=473–530|doi=10.1145/356887.356892|s2cid=6023466 |issn=0360-0300}}</ref><ref>{{Cite journal |last1=Li |first1=Chunlin |last2=Song |first2=Mingyang |last3=Du |first3=Shaofeng |last4=Wang |first4=Xiaohai |last5=Zhang |first5=Min |last6=Luo |first6=Youlong |date=2020-09-01 |title=Adaptive priority-based cache replacement and prediction-based cache prefetching in edge computing environment |url=https://linkinghub.elsevier.com/retrieve/pii/S1084804520301892 |journal=Journal of Network and Computer Applications |language=en |volume=165 |article-number=102715 |doi=10.1016/j.jnca.2020.102715|s2cid=219506427 |url-access=subscription }}</ref> Most modern computer processors have fast and local [[
== Data vs. instruction cache prefetching ==
Line 11:
Cache prefetching can be accomplished either by hardware or by software.<ref name=":2" />
* '''Hardware based prefetching''' is typically accomplished by having a dedicated hardware mechanism in the processor that watches the stream of instructions or data being requested by the executing program, recognizes the next few elements that the program might need based on this stream and prefetches into the processor's cache.<ref>{{Cite conference |last1=Baer |first1=Jean-Loup |last2=Chen |first2=Tien-Fu |date=1991-01-01 |title=An Effective On-chip Preloading Scheme to Reduce Data Access Penalty |conference=1991 ACM/IEEE Conference on Supercomputing |___location=
* '''Software based prefetching''' is typically accomplished by having the compiler analyze the code and insert additional "prefetch" instructions in the program during compilation itself.<ref>{{Cite thesis|last=Kennedy|first=Porterfield, Allan|title=Software methods for improvement of cache performance on supercomputer applications|date=1989-01-01|publisher=Rice University|hdl=1911/19069}}</ref>
Line 18:
=== Stream buffers ===
* Stream buffers were developed based on the concept of "one block lookahead (OBL) scheme" proposed by [[Alan Jay Smith]].<ref name=":3" />
* Stream [[Data buffer|buffers]] are one of the most common hardware based prefetching techniques in use. This technique was originally proposed by [[Norman Jouppi]] in 1990<ref name=":1">{{cite conference |
[[File:CachePrefetching_StreamBuffers.png|center|
* Whenever the prefetch mechanism detects a miss on a memory block, say A, it allocates a stream to begin prefetching successive blocks from the missed block onward. If the stream buffer can hold 4 blocks, then
* This mechanism can be scaled up by adding multiple such 'stream buffers' - each of which would maintain a separate prefetch stream.<ref>{{Cite conference |last1=Ishii |first1=Yasuo |last2=Inaba |first2=Mary |last3=Hiraki |first3=Kei |date=2009-06-08 |title=Access map pattern matching for data cache prefetch |url=https://doi.org/10.1145/1542275.1542349
* The ideal depth of the stream buffer is something that is subject to experimentation against various benchmarks<ref name=":1" /> and depends on the rest of the [[microarchitecture]] involved.<ref>{{Cite conference |last1=Srinath |first1=Santhosh |last2=Mutlu |first2=Onur |last3=Kim |first3=Hyesoon |author3-link=Hyesoon Kim|last4=Patt |first4=Yale N.|author4-link=Yale Patt |date=February 2007 |title=Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers
=== Strided prefetching ===
Line 28:
==== Regular strides ====
In this pattern, consecutive memory accesses are made to blocks that are <math>s</math> addresses apart.<ref name=":2" /><ref>{{Cite conference |last1=Kondguli |first1=Sushant |last2=Huang |first2=Michael |date=November 2017 |title=T2: A Highly Accurate and Energy Efficient Stride Prefetcher
==== Irregular spatial strides ====
In this case, the delta between the addresses of consecutive memory accesses is variable but still follows a pattern. Some prefetchers designs<ref
====
This class of prefetchers look for memory access streams that repeat over time.<ref>{{Cite conference |last1=Joseph |first1=Doug |last2=Grunwald |first2=Dirk |date=1997-05-01 |title=Prefetching using Markov predictors |url=https://doi.org/10.1145/264107.264207 |
=== Collaborative prefetching ===
Computer applications generate a variety of access patterns. The processor and memory subsystem architectures used to execute these applications further disambiguate the memory access patterns they generate. Hence, the effectiveness and efficiency of prefetching schemes often depend on the application and the architectures used to execute them.<ref>{{Cite journal |last1=Kim |first1=Jinchun |last2=Teran |first2=Elvira |last3=Gratz |first3=Paul V. |last4=Jiménez |first4=Daniel A. |last5=Pugsley |first5=Seth H. |last6=Wilkerson |first6=Chris |date=2017-05-12 |title=Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy
== Methods of software prefetching ==
Line 54:
array1[i] = 2 * array1[i];
}
</syntaxhighlight>At each iteration, the i<sup>th</sup> element of the array "array1" is accessed. Therefore,
for (int i=0; i<1024; i++) {
prefetch (array1 [i + k]);
array1[i] = 2 * array1[i];
}
</syntaxhighlight>Here, the prefetch stride, <math>k</math> depends on two factors, the cache miss penalty and the time it takes to execute a single iteration of the '''''for''''' loop. For instance, if one iteration of the loop takes 7 cycles to execute, and the cache miss penalty is 49 cycles then
== Comparison of hardware and software prefetching ==
* While software prefetching requires programmer or [[compiler]] intervention, hardware prefetching requires special hardware mechanisms.<ref name=":2" />
* Software prefetching works well only with loops where there is regular array access as the programmer has to hand code the prefetch instructions, whereas hardware prefetchers work dynamically based on the program's behavior at [[Run time (program lifecycle phase)|runtime]].<ref name=":2" />
* Hardware prefetching also has less CPU overhead when compared to software prefetching.<ref>{{Cite conference |last1=Callahan |first1=David |last2=Kennedy |first2=Ken |last3=Porterfield |first3=Allan |date=1991-01-01 |title=Software Prefetching |conference=Fourth International Conference on Architectural Support for Programming Languages and Operating Systems |___location=Santa Clara,
| date=2012| volume=9| pages=1–29}}</ref>
== Metrics of cache prefetching ==
There are three main metrics to judge cache prefetching<ref name=":2">{{Cite book |last=Solihin |first=Yan |title=Fundamentals of parallel multicore architecture
=== Coverage ===
Line 74 ⟶ 75:
<math>Coverage = \frac{\text{Cache Misses eliminated by Prefetching}}{\text{Total Cache Misses}}</math>,
where,
=== Accuracy ===
Line 81 ⟶ 82:
<math>\text{Prefetch Accuracy} = \frac{\text{Cache Misses eliminated by prefetching}}{(\text{Useless Cache Prefetches}) + (\text{Cache Misses eliminated by prefetching})}</math>
While it appears that having perfect accuracy might imply that there are no misses, this is not the case. The prefetches themselves might result in new misses if the prefetched blocks are placed directly into the cache. Although these may be a small fraction of the total number of misses
=== Timeliness ===
The qualitative definition of timeliness is how early a block is prefetched versus when it is actually referenced. An example to further explain timeliness is as follows
Consider a for loop where each iteration takes 3 cycles to execute and the 'prefetch' operation takes 12 cycles. This implies that for the prefetched data to be useful,
== See also ==
|