Cache prefetching: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Alter: journal, title. Add: pages, chapter, citeseerx, isbn, s2cid, authors 1-1. Removed parameters. Formatted dashes. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar
Use {{cite conference}} for conference papers.
Line 18:
=== Stream buffers ===
* Stream buffers were developed based on the concept of "one block lookahead (OBL) scheme" proposed by [[Alan Jay Smith]].<ref name=":3" />
* Stream [[Data buffer|buffers]] are one of the most common hardware based prefetching techniques in use. This technique was originally proposed by [[Norman Jouppi]] in 1990<ref name=":1">{{cite conference | last=Jouppi | first=Norman P. | book-title=Proceedings of the 17th annual international symposium on Computer Architecture - ISCA '90 | chaptertitle=Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers | publisher=ACM Press | ___location=New York, New York, USA | year=1990 | pages=364–373 | isbn=0-89791-366-3 | doi=10.1145/325164.325162 |citeseerx=10.1.1.37.6114}}</ref> and many variations of this method have been developed since.<ref>{{Cite journal|last1=Chen|first1=Tien-Fu|last2=Baer|first2=Jean-Loup|s2cid=1450745|date=1995-05-01|title=Effective hardware-based data prefetching for high-performance processors|journal=IEEE Transactions on Computers|volume=44|issue=5|pages=609–623|doi=10.1109/12.381947|issn=0018-9340}}</ref><ref>{{Cite conference|last1=Palacharla|first1=S.|last2=Kessler|first2=R. E.|date=1994-01-01|title=Evaluating Stream Buffers As a Secondary Cache Replacement|conference=21st Annual International Symposium on Computer Architecture|___location=Chicago, IL, USA|publisher=IEEE Computer Society Press|pages=24–33|doi= 10.1109/ISCA.1994.288164|isbn=978-0818655104|citeseerx=10.1.1.92.3031}}</ref><ref>{{cite journal| last1=Grannaes | first1=Marius | last2=Jahre | first2=Magnus | last3=Natvig | first3=Lasse | title=Storage Efficient Hardware Prefetching using Delta-Correlating Prediction Tables |citeseerx=10.1.1.229.3483 |journal=Journal of Instruction-Level Parallelism |issue=13 |year=2011 |pages=1–16}}</ref> The basic idea is that the [[cache miss]] address (and <math>k</math> subsequent addresses) are fetched into a separate buffer of depth <math>k</math>. This buffer is called a stream buffer and is separate from cache. The processor then consumes data/instructions from the stream buffer if the address associated with the prefetched blocks match the requested address generated by the program executing on the processor. The figure below illustrates this setup:
[[File:CachePrefetching_StreamBuffers.png|center|<ref name=":1"/> A typical stream buffer setup as originally proposed by Norman Jouppi in 1990|alt=A typical stream buffer setup as originally proposed|thumb|400x400px]]
* Whenever the prefetch mechanism detects a miss on a memory block, say A, it allocates a stream to begin prefetching successive blocks from the missed block onward. If the stream buffer can hold 4 blocks, then we would prefetch A+1, A+2, A+3, A+4 and hold those in the allocated stream buffer. If the processor consumes A+1 next, then it shall be moved "up" from the stream buffer to the processor's cache. The first entry of the stream buffer would now be A+2 and so on. This pattern of prefetching successive blocks is called '''Sequential Prefetching'''. It is mainly used when contiguous locations are to be prefetched. For example, it is used when prefetching instructions.
* This mechanism can be scaled up by adding multiple such 'stream buffers' - each of which would maintain a separate prefetch stream.<ref>{{Cite journalconference |last1=Ishii |first1=Yasuo |last2=Inaba |first2=Mary |last3=Hiraki |first3=Kei |date=2009-06-08 |title=Access map pattern matching for data cache prefetch |url=https://doi.org/10.1145/1542275.1542349 |journalbook-title=Proceedings of the 23rd International Conference on Supercomputing |seriesconference=ICS '09 |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=499–500 |doi=10.1145/1542275.1542349 |isbn=978-1-60558-498-0|s2cid=37841036 }}</ref> For each new miss, there would be a new stream buffer allocated and it would operate in a similar way as described above.
* The ideal depth of the stream buffer is something that is subject to experimentation against various benchmarks<ref name=":1" /> and depends on the rest of the [[microarchitecture]] involved.<ref>{{Cite journalconference |last1=Srinath |first1=Santhosh |last2=Mutlu |first2=Onur |last3=Kim |first3=Hyesoon |last4=Patt |first4=Yale N. |date=February 2007 |title=Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers |url=https://ieeexplore.ieee.org/document/4147648 |journalconference=2007 IEEE 13th International Symposium on High Performance Computer Architecture |pages=63–74 |doi=10.1109/HPCA.2007.346185|isbn=978-1-4244-0804-7 |s2cid=6909725 }}</ref>
 
=== Strided prefetching ===
Line 28:
 
==== Regular strides ====
In this pattern, consecutive memory accesses are made to blocks that are <math>s</math> addresses apart.<ref name=":2" /><ref>{{Cite journalconference |last1=Kondguli |first1=Sushant |last2=Huang |first2=Michael |date=November 2017 |title=T2: A Highly Accurate and Energy Efficient Stride Prefetcher |url=https://ieeexplore.ieee.org/document/8119237 |journalconference=2017 IEEE International Conference on Computer Design (ICCD) |pages=373–376 |doi=10.1109/ICCD.2017.64|isbn=978-1-5386-2254-4 |s2cid=11055312 }}</ref> In this case, the prefetcher calculates the <math>s</math> and uses it to compute the memory address for prefetching. Eg: If the <math>s</math> is 4, the address to be prefetched would A+4.
 
==== Irregular strides ====
In this case, the delta between the addresses of consecutive memory accesses is variable but still follows a pattern. Some prefetchers designs<ref>{{Citation |last1=Grannaes |first1=Marius |title=Storage Efficient Hardware Prefetching using Delta-Correlating Prediction Tables |url=https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.229.3483 |access-date=2022-03-16 |last2=Jahre |first2=Magnus |last3=Natvig |first3=Lasse|citeseerx=10.1.1.229.3483 }}</ref><ref>{{Cite journalconference |last1=Shevgoor |first1=Manjunath |last2=Koladiya |first2=Sahil |last3=Balasubramonian |first3=Rajeev |last4=Wilkerson |first4=Chris |last5=Pugsley |first5=Seth H |last6=Chishti |first6=Zeshan |date=December 2015 |title=Efficiently prefetching complex address patterns |url=https://ieeexplore.ieee.org/document/7856594 |journalconference=2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) |pages=141–152 |doi=10.1145/2830772.2830793|isbn=9781450340342 |s2cid=15294463 }}</ref><ref>{{Cite journalconference |last1=Kim |first1=Jinchun |last2=Pugsley |first2=Seth H. |last3=Gratz |first3=Paul V. |last4=Reddy |first4=A.L. Narasimha |last5=Wilkerson |first5=Chris |last6=Chishti |first6=Zeshan |date=October 2016 |title=Path confidence based lookahead prefetching |url=https://ieeexplore.ieee.org/document/7783763 |journalconference=2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) |pages=1–12 |doi=10.1109/MICRO.2016.7783763|isbn=978-1-5090-3508-3 |s2cid=1097472 }}</ref> exploit this property to predict and prefetch for future accesses.
 
=== Temporal prefetching ===
This class of prefetchers look for memory access streams that repeat over time.<ref>{{Cite journalconference |last1=Joseph |first1=Doug |last2=Grunwald |first2=Dirk |date=1997-05-01 |title=Prefetching using Markov predictors |url=https://doi.org/10.1145/264107.264207 |journalbook-title=Proceedings of the 24th Annual International Symposium on Computer Architecture |series=ISCA '97 |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=252–263 |doi=10.1145/264107.264207 |isbn=978-0-89791-901-2|s2cid=434419 }}</ref><ref>{{Cite journalconference |last1=Collins |first1=J. |last2=Sair |first2=S. |last3=Calder |first3=B. |last4=Tullsen |first4=D.M. |date=November 2002 |title=Pointer cache assisted prefetching |url=https://ieeexplore.ieee.org/abstract/document/1176239?casa_token=upVmgMfy6mkAAAAA:SdJqjEeVmXED1yvQV_26Cp2gGHNcUNXHqpmodWWWapYvYLandAIY2DxTiMKRF-sP7OJVP6qtwKc |journalconference=35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings. |pages=62–73 |doi=10.1109/MICRO.2002.1176239|isbn=0-7695-1859-1 |s2cid=5608519 }}</ref> Eg. In this stream of memory accesses: N, A, B, C, E, G, H, A, B, C, I, J, K, A, B, C, L, M, N, O, A, B, C, ....; the stream A,B,C is repeating over time. Other design variation have tried to provide more efficient, performant implementations.<ref>{{Cite journalconference |last1=Jain |first1=Akanksha |last2=Lin |first2=Calvin |date=2013-12-07 |title=Linearizing irregular memory accesses for improved correlated prefetching |url=https://doi.org/10.1145/2540708.2540730 |journalbook-title=Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture |seriesconference=MICRO-46 |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=247–259 |doi=10.1145/2540708.2540730 |isbn=978-1-4503-2638-4|s2cid=196831 }}</ref><ref>{{Cite web |title=Making Temporal Prefetchers Practical: The MISB Prefetcher - Research Articles - Arm Research - Arm Community |url=https://community.arm.com/arm-research/b/articles/posts/making-temporal-prefetchers-practical--the-misb-prefetcher |access-date=2022-03-16 |website=community.arm.com |language=en}}</ref>
 
=== Collaborative prefetching ===
Computer applications generate a variety of access patterns. The processor and memory subsystem architectures used to execute these applications further disambiguate the memory access patterns they generate. Hence, the effectiveness and efficiency of prefetching schemes often depend the application and the architectures used to execute them.<ref>{{Cite journal |last1=Kim |first1=Jinchun |last2=Teran |first2=Elvira |last3=Gratz |first3=Paul V. |last4=Jiménez |first4=Daniel A. |last5=Pugsley |first5=Seth H. |last6=Wilkerson |first6=Chris |date=2017-05-12 |title=Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy |url=https://dl.acm.org/doi/10.1145/3093336.3037701 |journal=ACM SIGPLAN Notices |language=en |volume=52 |issue=4 |pages=737–749 |doi=10.1145/3093336.3037701 |issn=0362-1340}}</ref> Recent research<ref>{{Cite journalconference |last1=Kondguli |first1=Sushant |last2=Huang |first2=Michael |date=2018-06-02 |title=Division of labor: a more effective approach to prefetching |url=https://doi.org/10.1109/ISCA.2018.00018 |journalbook-title=Proceedings of the 45th Annual International Symposium on Computer Architecture |seriesconference=ISCA '18 |___location=Los Angeles, California |publisher=IEEE Press |pages=83–95 |doi=10.1109/ISCA.2018.00018 |isbn=978-1-5386-5984-7|s2cid=50777324 }}</ref><ref>{{Cite journalconference |last1=Pakalapati |first1=Samuel |last2=Panda |first2=Biswabandan |date=May 2020 |title=Bouquet of Instruction Pointers: Instruction Pointer Classifier-based Spatial Hardware Prefetching |url=https://ieeexplore.ieee.org/document/9138971 |journalconference=2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) |pages=118–131 |doi=10.1109/ISCA45697.2020.00021|isbn=978-1-7281-4661-4 |s2cid=218683672 }}</ref> has focussed on building collaborative mechanisms to synergistically use multiple prefetching schemes for better prefetching coverage and accuracy.
 
== Methods of software prefetching ==