Cache prefetching: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 17:44, 4 January 2022 edit Guy Harris (talk \| contribs) Extended confirmed users 79,947 edits →Comparison of hardware and software prefetching: ce ← Previous edit		Latest revision as of 11:15, 25 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,866,970 edits Added article-number. Removed URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 361/1032
(35 intermediate revisions by 18 users not shown)
Line 1: {{short description\|Computer processing technique to boost memory performance}} '''Cache prefetching''' is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed (hence the term 'prefetch').<ref name=":3">{{Cite journal\|last=Smith\|first=Alan Jay\|date=1982-09-01\|title=Cache Memories\|journal=ACM Comput. Surv.\|volume=14\|issue=3\|pages=473–530\|doi=10.1145/356887.356892\|s2cid=6023466 \|issn=0360-0300}}</ref><ref>{{Cite journal \|last1=Li \|first1=Chunlin \|last2=Song \|first2=Mingyang \|last3=Du \|first3=Shaofeng \|last4=Wang \|first4=Xiaohai \|last5=Zhang \|first5=Min \|last6=Luo \|first6=Youlong \|date=2020-09-01 \|title=Adaptive priority-based cache replacement and prediction-based cache prefetching in edge computing environment \|url=https://linkinghub.elsevier.com/retrieve/pii/S1084804520301892 \|journal=Journal of Network and Computer Applications \|language=en \|volume=165 \|article-number=102715 \|doi=10.1016/j.jnca.2020.102715\|s2cid=219506427 \|url-access=subscription }}</ref> Most modern computer processors have fast and local [[~~Cache~~CPU ~~(computing)~~cache\|cache memory]] in which prefetched data is held until it is required. The source for the prefetch operation is usually [[Computer data storage#Primary storage\|main memory]]. Because of their design, accessing ~~[[Cache (computing)\|~~cache memories]] is typically much faster than accessing [[main memory]], so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from ~~[[Computer data storage#Primary storage\|~~main memory]]. Prefetching can be done with non-blocking [[cache control instruction]]s. == Data vs. instruction cache prefetching == Line 6: * '''Data prefetching''' fetches data before it is needed. Because data access patterns show less regularity than instruction patterns, accurate data prefetching is generally more challenging than instruction prefetching. * '''Instruction prefetching''' fetches instructions before they need to be executed. The first mainstream microprocessors to use some form of instruction prefetch were the [[Intel]] [[8086]] (six bytes) and the [[Motorola]] [[68000]] (four bytes). In recent years, all high-performance processors use prefetching techniques. Line 12 ⟶ 11: Cache prefetching can be accomplished either by hardware or by software.<ref name=":2" /> * '''Hardware based prefetching''' is typically accomplished by having a dedicated hardware mechanism in the processor that watches the stream of instructions or data being requested by the executing program, recognizes the next few elements that the program might need based on this stream and prefetches into the processor's cache.<ref>{{Cite conference \|last1=Baer \|first1=Jean-Loup \|last2=Chen \|first2=Tien-Fu \|date=1991-01-01 \|title=An Effective On-chip Preloading Scheme to Reduce Data Access Penalty \|conference=1991 ACM/IEEE Conference on Supercomputing \|___location= Albuquerque, NMNew Mexico, USA \|publisher=~~ACM~~Association for Computing Machinery \|pages=176–186 \|citeseerx=10.1.1.642.703 \|doi=10.1145/125826.125932 \|isbn=978-0897914598~~\|citeseerx=10.1.1.642.703~~}}</ref> * '''Software based prefetching''' is typically accomplished by having the compiler analyze the code and insert additional "prefetch" instructions in the program during compilation itself.<ref>{{Cite thesis\|last=Kennedy\|first=Porterfield, Allan\|title=Software methods for improvement of cache performance on supercomputer applications\|date=1989-01-01\|publisher=Rice University\|hdl=1911/19069}}</ref> Line 20 ⟶ 18: === Stream buffers === * Stream buffers were developed based on the concept of "one block lookahead (OBL) scheme" proposed by [[Alan Jay Smith]].<ref name=":3" /> * Stream [[Data buffer\|buffers]] are one of the most common hardware based prefetching techniques in use. This technique was originally proposed by [[Norman Jouppi]] in 1990<ref name=":1">{{cite conference \| last=Jouppi \| first=Norman P. \|year=1990 \|title=Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers \| ~~publisher~~conference=~~ACM~~17th ~~Press~~annual \|international symposium on Computer Architecture – ISCA 1990 \|___location=New York, New York, USA \| ~~year~~publisher=~~1990~~Association \|for ~~isbn=0-89791-366-3~~Computing \|Machinery ~~doi~~Press \|pages=~~10.1145/325164.325162~~364–373 \|citeseerx=10.1.1.37.6114 \|doi=10.1145/325164.325162 \|isbn=0-89791-366-3 \|book-title=Proceedings of the 17th annual international symposium on Computer Architecture – ISCA 1990}}</ref> and many variations of this method have been developed since.<ref>{{Cite journal\|last1=Chen\|first1=Tien-Fu\|last2=Baer\|first2=Jean-Loup\|s2cid=1450745\|date=1995-05-01\|title=Effective hardware-based data prefetching for high-performance processors\|journal=IEEE Transactions on Computers\|volume=44\|issue=5\|pages=609–623\|doi=10.1109/12.381947\|issn=0018-9340}}</ref><ref>{{Cite conference \|last1=Palacharla \|first1=S. \|last2=Kessler \|first2=R. E. \|date=1994-01-01 \|title=Evaluating Stream Buffers As a Secondary Cache Replacement \|conference=21st Annual International Symposium on Computer Architecture \|___location=Chicago, ILIllinois, USA \|publisher=IEEE Computer Society Press \|pages=24–33 \|~~doi~~citeseerx=10.1.1.92.3031 \|doi=10.1109/ISCA.1994.288164 \|isbn=978-0818655104~~\|citeseerx=10.1.1.92.3031~~}}</ref><ref name="grannaes">{{cite journal\| last1=Grannaes \| first1=Marius \| last2=Jahre \| first2=Magnus \| last3=Natvig \| first3=Lasse \| title=Storage Efficient Hardware Prefetching using Delta-Correlating Prediction Tables \|citeseerx=10.1.1.229.3483 \|journal=Journal of Instruction-Level Parallelism \|issue=13 \|year=2011 \|pages=1–16}}</ref> The basic idea is that the [[cache miss]] address (and <math>k</math> subsequent addresses) are fetched into a separate buffer of depth <math>k</math>. This buffer is called a stream buffer and is separate from the cache. The processor then consumes data/instructions from the stream buffer if the address associated with the prefetched blocks match the requested address generated by the program executing on the processor. The figure below illustrates this setup: [[File:CachePrefetching_StreamBuffers.png\|center\|~~<ref name=":1"/>~~ A typical stream buffer setup as originally proposed by Norman Jouppi in 1990<ref name=":1" />\|alt=A typical stream buffer setup as originally proposed\|thumb\|~~400x400px~~upright=1.4]] * Whenever the prefetch mechanism detects a miss on a memory block, say A, it allocates a stream to begin prefetching successive blocks from the missed block onward. If the stream buffer can hold 4 blocks, then wethe processor would prefetch A+1, A+2, A+3, A+4 and hold those in the allocated stream buffer. If the processor consumes A+1 next, then it shall be moved "up" from the stream buffer to the processor's cache. The first entry of the stream buffer would now be A+2 and so on. This pattern of prefetching successive blocks is called '''Sequential Prefetching'''. It is mainly used when contiguous locations are to be prefetched. For example, it is used when prefetching instructions. * This mechanism can be scaled up by adding multiple such 'stream buffers' - each of which would maintain a separate prefetch stream.<ref>{{Cite conference \|last1=Ishii \|first1=Yasuo \|last2=Inaba \|first2=Mary \|last3=Hiraki \|first3=Kei \|date=2009-06-08 \|title=Access map pattern matching for data cache prefetch \|url=https://doi.org/10.1145/1542275.1542349 \|conference=ICS 2009 \|___location=New York, New York, USA \|publisher=Association for Computing Machinery \|pages=499–500 \|doi=10.1145/1542275.1542349 \|isbn=978-1-60558-498-0 \|book-title=Proceedings of the 23rd International Conference on Supercomputing \|s2cid=37841036\|url-access=subscription }}</ref> For each new miss, there would be a new stream buffer allocated and it would operate in a similar way as described above. * The ideal depth of the stream buffer is something that is subject to experimentation against various benchmarks<ref name=":1" /> and depends on the rest of the [[microarchitecture]] involved.<ref>{{Cite conference \|last1=Srinath \|first1=Santhosh \|last2=Mutlu \|first2=Onur \|last3=Kim \|first3=Hyesoon \|author3-link=Hyesoon Kim\|last4=Patt \|first4=Yale N.\|author4-link=Yale Patt \|date=February 2007 \|title=Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers \|conference=2007 IEEE 13th International Symposium on High Performance Computer Architecture \|pages=63–74 \|doi=10.1109/HPCA.2007.346185\|isbn=978-1-4244-0804-7 \|s2cid=6909725 }}</ref> === Strided prefetching === This type of prefetching monitors the delta between the addresses of the memory accesses and looks for patterns within it. ==== Regular strides ==== In this pattern, consecutive memory accesses are made to blocks that are <math>s</math> addresses apart.<ref name=":2" /><ref>{{Cite conference \|last1=Kondguli \|first1=Sushant \|last2=Huang \|first2=Michael \|date=November 2017 \|title=T2: A Highly Accurate and Energy Efficient Stride Prefetcher \|conference=2017 IEEE International Conference on Computer Design (ICCD) \|pages=373–376 \|doi=10.1109/ICCD.2017.64\|isbn=978-1-5386-2254-4 \|s2cid=11055312 }}</ref> In this case, the prefetcher calculates the <math>s</math> and uses it to compute the memory address for prefetching. E.g.: If the <math>s</math> is 4, the address to be prefetched would A+4. ==== Irregular spatial strides ==== In this case, the delta between the addresses of consecutive memory accesses is variable but still follows a pattern. Some prefetchers designs<ref name="grannaes" /><ref>{{Cite conference \|last1=Shevgoor \|first1=Manjunath \|last2=Koladiya \|first2=Sahil \|last3=Balasubramonian \|first3=Rajeev \|last4=Wilkerson \|first4=Chris \|last5=Pugsley \|first5=Seth H. \|last6=Chishti \|first6=Zeshan \|date=December 2015 \|title=Efficiently prefetching complex address patterns \|url=https://ieeexplore.ieee.org/document/7856594 \|conference=2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) \|pages=141–152 \|doi=10.1145/2830772.2830793 \|isbn=9781450340342 \|s2cid=15294463}}</ref><ref>{{Cite conference \|last1=Kim \|first1=Jinchun \|last2=Pugsley \|first2=Seth H. \|last3=Gratz \|first3=Paul V. \|last4=Reddy \|first4=A.L. Narasimha \|last5=Wilkerson \|first5=Chris \|last6=Chishti \|first6=Zeshan \|date=October 2016 \|title=Path confidence based lookahead prefetching \|conference=2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) \|pages=1–12 \|doi=10.1109/MICRO.2016.7783763\|isbn=978-1-5090-3508-3 \|s2cid=1097472 }}</ref> exploit this property to predict and prefetch for future accesses. ==== Irregular temporal prefetching ==== This class of prefetchers look for memory access streams that repeat over time.<ref>{{Cite conference \|last1=Joseph \|first1=Doug \|last2=Grunwald \|first2=Dirk \|date=1997-05-01 \|title=Prefetching using Markov predictors \|url=https://doi.org/10.1145/264107.264207 \|conference=ISCA 1997 \|series=ISCA 1997 \|___location=New York, New York, USA \|publisher=Association for Computing Machinery \|pages=252–263 \|doi=10.1145/264107.264207 \|isbn=978-0-89791-901-2 \|book-title=Proceedings of the 24th Annual International Symposium on Computer Architecture \|s2cid=434419}}</ref><ref>{{Cite conference \|last1=Collins \|first1=J. \|last2=Sair \|first2=S. \|last3=Calder \|first3=B. \|last4=Tullsen \|first4=D.M. \|date=November 2002 \|title=Pointer cache assisted prefetching \|conference=35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings. \|pages=62–73 \|doi=10.1109/MICRO.2002.1176239\|isbn=0-7695-1859-1 \|s2cid=5608519 }}</ref> E.g. In this stream of memory accesses: N, A, B, C, E, G, H, A, B, C, I, J, K, A, B, C, L, M, N, O, A, B, C, ...; the stream A, B, C is repeating over time. Other design variation have tried to provide more efficient, performant implementations.<ref>{{Cite conference \|last1=Jain \|first1=Akanksha \|last2=Lin \|first2=Calvin \|date=2013-12-07 \|title=Linearizing irregular memory accesses for improved correlated prefetching \|url=https://doi.org/10.1145/2540708.2540730 \|conference=MICRO-46 \|___location=New York, New York, USA \|publisher=Association for Computing Machinery \|pages=247–259 \|doi=10.1145/2540708.2540730 \|isbn=978-1-4503-2638-4 \|book-title=Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture \|s2cid=196831\|url-access=subscription }}</ref><ref>{{Cite web \|title=Making Temporal Prefetchers Practical: The MISB Prefetcher – Research Articles – Arm Research – Arm Community \|url=https://community.arm.com/arm-research/b/articles/posts/making-temporal-prefetchers-practical--the-misb-prefetcher \|access-date=2022-03-16 \|website=community.arm.com \|date=24 June 2019 \|language=en-us}}</ref> === Collaborative prefetching === Another pattern of prefetching instructions is to prefetch addresses that are <math>s</math> addresses ahead in the sequence. It is mainly used when the consecutive blocks that are to be prefetched are <math>s</math> addresses apart.<ref name=":2" /> This is termed as '''Strided Prefetching.''' Computer applications generate a variety of access patterns. The processor and memory subsystem architectures used to execute these applications further disambiguate the memory access patterns they generate. Hence, the effectiveness and efficiency of prefetching schemes often depend on the application and the architectures used to execute them.<ref>{{Cite journal \|last1=Kim \|first1=Jinchun \|last2=Teran \|first2=Elvira \|last3=Gratz \|first3=Paul V. \|last4=Jiménez \|first4=Daniel A. \|last5=Pugsley \|first5=Seth H. \|last6=Wilkerson \|first6=Chris \|date=2017-05-12 \|title=Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy \|journal=ACM SIGPLAN Notices \|language=en \|volume=52 \|issue=4 \|pages=737–749 \|doi=10.1145/3093336.3037701 \|issn=0362-1340\|doi-access=free }}</ref> Recent research<ref>{{Cite conference \|last1=Kondguli \|first1=Sushant \|last2=Huang \|first2=Michael \|date=2018-06-02 \|title=Division of labor: a more effective approach to prefetching \|url=https://doi.org/10.1109/ISCA.2018.00018 \|book-title=Proceedings of the 45th Annual International Symposium on Computer Architecture \|conference=ISCA '18 \|___location=Los Angeles, California \|publisher=IEEE Press \|pages=83–95 \|doi=10.1109/ISCA.2018.00018 \|isbn=978-1-5386-5984-7\|s2cid=50777324 \|url-access=subscription }}</ref><ref>{{Cite conference \|last1=Pakalapati \|first1=Samuel \|last2=Panda \|first2=Biswabandan \|date=May 2020 \|title=Bouquet of Instruction Pointers: Instruction Pointer Classifier-based Spatial Hardware Prefetching \|conference=2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) \|pages=118–131 \|doi=10.1109/ISCA45697.2020.00021\|isbn=978-1-7281-4661-4 \|s2cid=218683672 }}</ref> has focused on building collaborative mechanisms to synergistically use multiple prefetching schemes for better prefetching coverage and accuracy. == Methods of software prefetching == Line 43 ⟶ 54: array1[i] = 2 * array1[i]; } </syntaxhighlight>At each iteration, the i<sup>th</sup> element of the array "array1" is accessed. Therefore, wethe system can prefetch the elements that are going to be accessed in future iterations by inserting a "prefetch" instruction as shown below:<syntaxhighlight lang="c++"> for (int i=0; i<1024; i++) { prefetch (array1 [i + k]); array1[i] = 2 * array1[i]; } </syntaxhighlight>Here, the prefetch stride, <math>k</math> depends on two factors, the cache miss penalty and the time it takes to execute a single iteration of the '''''for''''' loop. For instance, if one iteration of the loop takes 7 cycles to execute, and the cache miss penalty is 49 cycles then wethere should ~~have~~be <math>k = 49/7 = 7</math> - which means that wethe system should prefetch 7 elements ahead. With the first iteration, i will be 0, so wethe ~~prefetch~~system prefetches the 7th element. Now, with this arrangement, the first 7 accesses (i=0->6) will still be misses (under the simplifying assumption that each element of array1 is in a separate cache line of its own). == Comparison of hardware and software prefetching == * While software prefetching requires programmer or [[compiler]] intervention, hardware prefetching requires special hardware mechanisms.<ref name=":2" /> * Software prefetching works well only with loops where there is regular array access as the programmer has to hand code the prefetch instructions, whereas hardware prefetchers work dynamically based on the program's behavior at [[Run time (program lifecycle phase)\|runtime]].<ref name=":2" /> * Hardware prefetching also has less CPU overhead when compared to software prefetching.<ref>{{Cite conference \|last1=Callahan \|first1=David \|last2=Kennedy \|first2=Ken \|last3=Porterfield \|first3=Allan \|date=1991-01-01 \|title=Software Prefetching \|conference=Fourth International Conference on Architectural Support for Programming Languages and Operating Systems \|___location=Santa Clara, CACalifornia, USA \|publisher=~~ACM~~Association for Computing Machinery \|pages=40–52 \|doi=10.1145/106972.106979 \|isbn=978-0897913805}}</ref> However, software prefetching can mitigate certain constraints of hardware prefetching, leading to improvements in performance.<ref>{{Citation \| url=https://faculty.cc.gatech.edu/~hyesoon/lee_taco12.pdf\| title=When Prefetching Works, When It Doesn't, and Why\| journal = ACM Trans. Archit. Code Optim.\| doi =10.1145/2133382.2133384\| author = Lee, Jaekyu and Kim, Hyesoon and Vuduc, Richard \| date=2012\| volume=9\| pages=1–29}}</ref> == Metrics of cache prefetching == There are three main metrics to judge cache prefetching<ref name=":2">{{Cite book \|last=Solihin \|first=Yan \|title=Fundamentals of parallel multicore architecture~~\|last=Solihin\|first=Yan~~ \|publisher=CRC Press, Taylor & Francis Group \|year=2016 \|isbn=978-1482211184 \|___location=Boca Raton, FLFlorida \|pages=163 \|language=en-us}}</ref> === Coverage === Line 63 ⟶ 75: <math>Coverage = \frac{\text{Cache Misses eliminated by Prefetching}}{\text{Total Cache Misses}}</math>, where, <math>\text{Total Cache Misses} = (\text{Cache misses eliminated by prefetching}) + (\text{Cache misses not eliminated by prefetching})</math> === Accuracy === Line 70 ⟶ 82: <math>\text{Prefetch Accuracy} = \frac{\text{Cache Misses eliminated by prefetching}}{(\text{Useless Cache Prefetches}) + (\text{Cache Misses eliminated by prefetching})}</math> While it appears that having perfect accuracy might imply that there are no misses, this is not the case. The prefetches themselves might result in new misses if the prefetched blocks are placed directly into the cache. Although these may be a small fraction of the total number of misses ~~we might see~~observed without any prefetching, this is a non-zero number of misses. === Timeliness === The qualitative definition of timeliness is how early a block is prefetched versus when it is actually referenced. An example to further explain timeliness is as follows : Consider a for loop where each iteration takes 3 cycles to execute and the 'prefetch' operation takes 12 cycles. This implies that for the prefetched data to be useful, wethe system must start the prefetch <math>12/3 = 4</math> iterations prior to its usage to maintain timeliness. == See also ==