Cache hierarchy: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 09:24, 3 April 2024 edit Rick Jelliffe (talk \| contribs) Extended confirmed users 10,923 edits →Recent implementation models: Add more modern examples Tags: Mobile edit Mobile web edit ← Previous edit		Latest revision as of 07:50, 12 August 2025 edit undo IrisChronomia (talk \| contribs) Extended confirmed users, IP block exemptions 1,287 edits →Recent implementation models: copyediting. Tag: Visual edit
(11 intermediate revisions by 9 users not shown)
Line 2: '''Cache hierarchy,''' or '''multi-level cache''', is a memory architecture that uses a hierarchy of memory stores based on varying access speeds to cache data. Highly requested data is cached in high-speed access memory stores, allowing swifter access by [[central processing unit]] (CPU) cores. Cache hierarchy is a form and part of [[memory hierarchy]] and can be considered a form of [[tiered storage]].<ref name="CA:QA">{{cite book \|last1=Hennessy \|first1=John L \|last2=Patterson \|first2=David A \|last3=Asanović \|first3=Krste \|author3-link=Krste Asanović \|last4=Bakos \|first4=Jason D \|last5=Colwell \|first5=Robert P \|last6=Bhattacharjee \|first6=Abhishek \|last7=Conte \|first7=Thomas M \|last8=Duato \|first8=José \|last9=Franklin \|first9=Diana \|last10=Goldberg \|first10=David \|last11=Jouppi \|first11=Norman P \|last12=Li \|first12=Sheng \|last13=Muralimanohar \|first13=Naveen \|last14=Peterson \|first14=Gregory D \|last15=Pinkston \|first15=Timothy Mark \|last16=Ranganathan \|first16=Prakash \|last17=Wood \|first17=David Allen \|last18=Young \|first18=Clifford \|last19=Zaky \|first19=Amr \|title=Computer Architecture: a Quantitative Approach \|date=2011 \|publisher=Elsevier Science \|isbn=978-0128119051 \|edition= Sixth \|language=en\|oclc=983459758 }}</ref> This design was intended to allow CPU cores to process faster despite the [[CAS latency\|memory latency]] of [[computer data storage\|main memory]] access. Accessing main memory can act as a bottleneck for [[computer performance\|CPU core performance]] as the CPU waits for data, while making all of main memory high-speed may be prohibitively expensive. High-speed caches are a compromise allowing high-speed access to the data most-used by the CPU, permitting a faster [[clock rate\|CPU clock]].<ref>{{Cite web\|url=http://gec.di.uminho.pt/discip/minf/ac0102/0945CacheLevel.pdf\|title=Cache: Why Level It}}</ref> [[File:Cache Organization.png\|thumb\|right\|429x429px\|Generic multi-level cache organization\|alt=Process architecture diagram showing four independent processors each linked through cache systems to main memory and input-output system.]] == Background == In the history of computer and electronic chip development, there was a period when increases in CPU speed outpaced the improvements in memory access speed.<ref>Ronald D. Miller; Lars I. Eriksson; Lee A Fleisher, 2014. Miller's Anesthesia E-Book. Elsevier Health Sciences. p. 75. {{ISBN\|978-0-323-28011-2}}.</ref> The gap between the speed of CPUs and memory meant that the CPU would often be idle.<ref>Albert Y. Zomaya, 2006. Handbook of Nature-Inspired and Innovative Computing: Integrating Classical Models with Emerging Technologies. Springer Science & Business Media. p. 298. {{ISBN\|978-0-387-40532-2}}.</ref> CPUs were increasingly capable of running and executing larger amounts of instructions in a given time, but the time needed to access data from main memory prevented programs from fully benefiting from this capability.<ref>Richard C. Dorf, 2018. Sensors, Nanoscience, Biomedical Engineering, and Instruments: Sensors Nanoscience Biomedical Engineering. CRC Press. p. 4. {{ISBN\|978-1-4200-0316-1}}.</ref> This issue motivated the creation of memory models with higher access rates in order to realize the potential of faster processors.<ref>David A. Patterson; John L. Hennessy, 2004. Computer Organization and Design: The Hardware/Software Interface, Third Edition. Elsevier. p. 552. {{ISBN\|978-0-08-050257-1}}.</ref> This resulted in the concept of [[cache memory]], first proposed by [[Maurice Wilkes]], a British computer scientist at the University of Cambridge in 1965. He called such memory models "slave memory".<ref>{{Cite news\|url=https://www.britannica.com/biography/Maurice-Vincent-Wilkes\|title=Sir Maurice Vincent Wilkes {{!}} British computer scientist\|newspaper=Encyclopædia Britannica\|access-date=2016-12-11}}</ref> Between roughly 1970 and 1990, papers and articles by [[Anant Agarwal]], [[Alan Jay Smith]], [[Mark D. Hill]], Thomas R. Puzak, and others discussed better cache memory designs. The first cache memory models were implemented at the time, but even as researchers were investigating and proposing better designs, the need for faster memory models continued. This need resulted from the fact that although early cache models improved data access latency, with respect to cost and technical limitations it was not feasible for a computer system's cache to approach the size of main memory. From 1990 onward, ideas such as adding another cache level (second-level), as a backup for the first-level cache were proposed. [[Jean-Loup Baer]], Wen-Hann Wang, Andrew W. Wilson, and others have conducted research on this model. When several simulations and implementations demonstrated the advantages of two-level cache models, the concept of multi-level caches caught on as a new and generally better model of cache memories. Since 2000, multi-level cache models have received widespread attention and are currently implemented in many systems, such as the three-level caches that are present in Intel's Core i7 products.<ref>{{Cite news\|url=https://www.edn.com/memory-hierarchy-design-part-6-the-intel-core-i7-fallacies-and-pitfalls\|title=Memory Hierarchy Design -– Part 6. The Intel Core i7, fallacies, and pitfalls\|last=Berkeley\|first=John L. Hennessy, Stanford University, and David A. Patterson, University of California\|newspaper=EDN\|access-date=2022-10-13}}</ref> == Multi-level cache == Line 19: : <math> \text{AAT} = \text{hit time}+((\text{miss rate})\times(\text{miss penalty}))</math> AAT for main memory is given by Hit time <sub>main memory</sub>. AAT for caches can be given by: : Hit ~~time~~Time<sub>cache</sub> + (Miss ~~rate~~Rate<sub>cache</sub> × Miss Penalty<sub>time taken to go to main memory after missing cache</sub>).{{explain\|date=July 2018\|shouldn't this be: AAT = hit rate * hit time + miss rate * miss penalty? The difference between hit time and miss penalty may be large enough for an L1 cache that this is insignificant, but by the time you get to L4 this becomes a larger factor.}} The hit time for caches is less than the hit time for the main memory, so the AAT for data retrieval is significantly lower when accessing data through the cache rather than main memory.<ref>Cetin Kaya Koc, 2008. Cryptographic Engineering. Springer Science & Business Media. pp. 479–480. {{ISBN\|978-0-387-71817-0}}.</ref> Line 29: === Evolution === [[File:Cache_Hierarchy_Updated.png\|thumb\|250x250px\|Cache hierarchy for up to L3 level of cache and main memory with on-chip L1\|alt=A series of rectangles of increasing proportions representing increasing memory from on-CPU registers and L1 cache through L2, L3, and main memory.]] In the case of a cache miss, the purpose of using such a structure will be rendered useless and the computer will have to go to the main memory to fetch the required data. However, with a [[~~Memory~~memory hierarchy\|multiple-level cache]], if the computer misses the cache closest to the processor (level-one cache or L1) it will then search through the next-closest level(s) of cache and go to main memory only if these methods fail. The general trend is to keep the L1 cache small and at a distance of 1–2 CPU clock cycles from the processor, with the lower levels of caches increasing in size to store more data than L1, hence being more distant but with a lower miss rate. This results in a better AAT.<ref>David A. Patterson; John L. Hennessy; 2008. Computer Organization and Design: The Hardware/Software Interface. Morgan Kaufmann. pp. 489–492. {{ISBN\|978-0-08-092281-2}}.</ref> The number of cache levels can be designed by architects according to their requirements after checking for trade-offs between cost, AATs, and size.<ref>Harvey G. Cragon, 2000. Computer Architecture and Implementation. Cambridge University Press. pp. 95–97. {{ISBN\|978-0-521-65168-4}}.</ref><ref>Baker Mohammad, 2013. Embedded Memory Design for Multi-Core and Systems on Chip. Springer Science & Business Media. pp. 11–14. {{ISBN\|978-1-4614-8881-1}}.</ref> === Performance gains === Line 36: ''Example'': main memory = 50 {{abbr\|ns\|nanoseconds}}, L1 = 1 ns with 10% miss rate, L2 = 5 ns with 1% miss rate, L3 = 10 ns with 0.2% miss rate. * No cache, AAT = 50 ns * L1 cache, AAT = 1 ns + (0.1 × 50 ns) = 6 ns * L1–2 caches, AAT = 1 ns + (0.1 × [5 ns + (0.01 × 50 ns)]) = 1.55 ns * L1–3 caches, AAT = 1 ns + (0.1 × [5 ns + (0.01 × [10 ns + (0.002 × 50 ns)])]) = 1.5101 ns === Disadvantages === Line 48: == Properties == [[File:Separate unified.png\|thumb\|Cache organization with L1 as separate and L2 as unified\|255x255px\|alt=three squares showing separated on-CPU L1 caches for instructions and data, an off-chip L2 cache, and main memory.]] === Banked versus unified === In a banked cache, the cache is divided into a cache dedicated to [[~~Machine~~machine code\|instruction]] storage and a cache dedicated to data. In contrast, a unified cache contains both the instructions and data in the same cache.<ref>Yan Solihin, 2015. Fundamentals of Parallel Multicore Architecture. CRC Press. p. 150. {{ISBN\|978-1-4822-1119-1}}.</ref> During a process, the L1 cache (or most upper-level cache in relation to its connection to the processor) is accessed by the processor to retrieve both instructions and data. Requiring both actions to be implemented at the same time requires multiple ports and more access time in a unified cache. Having multiple ports requires additional hardware and wiring, leading to a significant structure between the caches and processing units.<ref>Steve Heath, 2002. Embedded Systems Design. Elsevier. p. 106. {{ISBN\|978-0-08-047756-5}}.</ref> To avoid this, the L1 cache is often organized as a banked cache which results in fewer ports, less hardware, and generally lower access times.<ref name=":1" /> Modern processors have split caches, and in systems with multilevel caches ~~higher~~lower level caches may be unified while ~~lower~~higher levels split.<ref name="CA:QA" /><ref>Alan Clements, 2013. Computer Organization & Architecture: Themes and Variations. Cengage Learning. p. 588. {{ISBN\|1-285-41542-6}}.</ref> === Inclusion policies === [[File:Inclusivecache.png\|thumb\|Inclusive cache organization\|413x413px\|alt=a memory system diagram showing a copy of the L1 within L2 and a copy of the L2 within L3.]] Whether a block present in the upper cache layer can also be present in the lower cache level is governed by the memory system's [[Cache inclusion policy\|inclusion policy]], which may be inclusive, exclusive or non-inclusive non-exclusive (NINE).{{cn\|date=December 2023}} With an inclusive policy, all the blocks present in the upper-level cache have to be present in the lower-level cache as well. Each upper-level cache component is a subset of the lower-level cache component. In this case, since there is a duplication of blocks, there is some wastage of memory. However, checking is faster.{{cn\|date=December 2023}} Under an exclusive policy, all the cache hierarchy components are completely exclusive, so that any element in the upper-level cache will not be present in any of the lower cache components. This enables complete usage of the cache memory. However, there is a high memory-access latency.<ref>{{Cite web\|url=~~http~~https://mercury.pr.erau.edu/~davisb22/papers/ispass04.pdf\|title=Performance Evaluation of Exclusive Cache Hierarchies\|access-date=2016-10-19\|archive-date=2012-08-13\|archive-url=https://web.archive.org/web/20120813003941/http://mercury.pr.erau.edu/~davisb22/papers/ispass04.pdf\|url-status=dead}}</ref> The above policies require a set of rules to be followed in order to implement them. If none of these are forced, the resulting inclusion policy is called non-inclusive non-exclusive (NINE). This means that the upper-level cache may or may not be present in the lower-level cache.<ref name=":4">{{Cite book\|title=Fundamentals of Parallel Multicore Architecture\|last=Solihin\|first=Yan\|publisher=Chapman and Hall\|year=2016\|isbn=9781482211184\|pages=Chapter 5: Introduction to Memory Hierarchy Organization}}</ref> Line 74: In case of a write where the byte is not present in the cache block, the byte may be brought to the cache as determined by a write allocate or write no-allocate policy.<ref name=":0">{{Cite book\|title=Fundamentals of Parallel Computer Architecture\|last=Solihin\|first=Yan\|publisher=Solihin Publishing\|year=2009\|isbn=9780984163007\|pages=Chapter 6: Introduction to Memory Hierarchy Organization}}</ref> Write allocate policy states that in case of a write miss, the block is fetched from the main memory and placed in the cache before writing.<ref>Harvey G. Cragon, 1996. Memory Systems and Pipelined Processors. Jones & Bartlett Learning. p. 47. {{ISBN\|978-0-86720-474-2}}.</ref> In the write no-allocate policy, if the block is missed in the cache it will write in the lower-level memory hierarchy without fetching the block into the cache.<ref>David A. Patterson; John L. Hennessy; 2007. Computer Organization and Design, Revised Printing, Third Edition: The Hardware/Software Interface. Elsevier. p. 484. {{ISBN\|978-0-08-055033-6}}.</ref> The common combinations of the policies are [[Cache (computing)#Writing policies\|"write ~~block"~~back, "write allocate", and "write through, write no-allocate"]]. === Shared versus private === [[File:Shared private.png\|thumb\|Cache organization with L1 private and L2 and L3 shared\|alt=Three CPUs each have private on-chip L1 caches but share the off-chip L2, L3, and main memory.]] A private cache is assigned to one particular core in a processor, and cannot be accessed by any other cores. In some architectures, each core has its own private cache; this creates the risk of duplicate blocks in a system's cache architecture, which results in reduced capacity utilization. However, this type of design choice in a multi-layer cache architecture can also be good for a lower data-access latency.<ref name=":0" /><ref>{{Cite web\|url=https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems\|title=Software Techniques for Shared-Cache Multi-Core Systems\|date=2018-05-24}}</ref><ref>{{Cite web\|url=~~http~~https://hpcaconf.org/hpca13/papers/002-dybdahl.pdf\|title=An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors\|archive-url=https://web.archive.org/web/20161019143428/http://hpcaconf.org/hpca13/papers/002-dybdahl.pdf\|archive-date=2016-10-19\|url-status=dead}}</ref> A shared cache is a cache which can be accessed by multiple cores.<ref>Akanksha Jain; Calvin Lin; 2019. Cache Replacement Policies. Morgan & Claypool Publishers. p. 45. {{ISBN\|978-1-68173-577-1}}.</ref> Since it is shared, each block in the cache is unique and therefore has a larger hit rate as there will be no duplicate blocks. However, data-access latency can increase as multiple cores try to access the same cache.<ref>David Culler; Jaswinder Pal Singh; Anoop Gupta; 1999. Parallel Computer Architecture: A Hardware/Software Approach. Gulf Professional Publishing. p. 436. {{ISBN\|978-1-55860-343-1}}.</ref> Line 85: == Recent implementation models == [[File:Nehalem EP.png\|thumb\|387x387px\|Cache organization of Intel Nehalem microarchitecture<ref>{{Cite web\|url=~~http~~https://sc.tamu.edu/systems/eos/nehalem.pdf\|title=The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms\|archive-url=https://web.archive.org/web/20140811023120/http://sc.tamu.edu/systems/eos/nehalem.pdf#\|archive-date=2014-08-11\|url-status=dead}}</ref>]] === Intel Xeon Emerald Rapids (2024) === Up to 64 -core: * L1 cache -– 80 {{abbr\|kBKB\|kilobytes}} per core * L2 cache -– 2 {{abbr\|MB\|megabytes}} per core * L3 cache -– 5 {{abbr\|MB\|megabytes}} per core (i.e., up to 320 {{abbr\|MB\|megabytes}} total) === ~~AMD~~Intel ~~EPYC~~i5 ~~9684X~~Raptor Lake-HX (~~2023~~2024) === 6-core (performance \| efficiency): 96 core: ▼ * L1 cache -– 64128 {{abbr\|kBKB\|kilobytes}} per core * L2 cache -– 12 {{abbr\|MB\|megabytes}} per core \| 4–8 {{abbr\|MB\|megabytes}} semi-shared * L3 cache -– ~~1152~~20–24 {{abbr\|MB\|megabytes}} shared === ~~Intel~~AMD ~~Broadwell~~EPYC ~~microarchitecture~~9684X (~~2014~~Zen 4, 2023) === ▲96 -core: * L1 cache (instruction and data) – 64 {{abbr\|kB\|kilobytes}} per core▼ * L2L1 cache – ~~256~~64 kB{{abbr\|KB\|kilobytes}} per core * L3L2 cache – 21 {{~~Abbr~~abbr\|MB\|megabytes}} toper ~~6 MB shared~~core * L3 cache – 1152 {{abbr\|MB\|megabytes}} shared * L4 cache – 128 MB of eDRAM (Iris Pro models only)<ref name=":2">{{Cite web\|url=http://ark.intel.com/\|title=Intel Broadwell Microarchitecture}}</ref>▼ === ~~Intel~~Apple ~~Kaby~~M1 ~~Lake microarchitecture~~Ultra (~~2016~~2022) === 20-core (4:1 "performance" core \| "efficiency" core): * L1 cache (instruction and data) – 64 kB per core▼ * L2L1 cache – ~~256~~320\|192 kB{{abbr\|KB\|kilobytes}} per core * L2 cache – 52 {{abbr\|MB\|megabytes}} semi-shared * L3 cache – 2 MB to 8 MB shared<ref name=":3">{{Cite web\|url=http://ark.intel.com/\|title=Intel Kaby Lake Microrchitecture}}</ref>▼ * L3 cache – 96 {{abbr\|MB\|megabytes}} shared === AMD Zen ~~microarchitecture~~3 (~~2017~~2022) === 6- to 16-core: * L1 cache – 32 kB data & 64 kB instruction per core, 4-way▼ * L2L1 cache – ~~512~~64 kB{{abbr\|KB\|kilobytes}} per core~~, 4-way inclusive~~ * L2 cache – 1 {{abbr\|MB\|megabytes}} per core * L3 cache – 4 MB local & remote per 4-core CCX, 2 CCXs per chiplet, 16-way non-inclusive. Up to 16 MB on desktop CPUs and 64 MB on server CPUs▼ * L3 cache – 32 to 128 {{abbr\|MB\|megabytes}} shared === AMD Zen 2 ~~microarchitecture~~ (2019) === * L1 cache – 32 kBKB data & 32 kBKB instruction per core, 8-way * L2 cache – 512 kBKB per core, 8-way inclusive * L3 cache – 16 MB local per 4-core CCX, 2 CCXs per chiplet, 16-way non-inclusive. Up to 64 MB on desktop CPUs and 256 MB on server CPUs === AMD Zen (2017) === ▲* L1 cache – 32 kBKB data & 64 kBKB instruction per core, 4-way * L2 cache – 512 KB per core, 4-way inclusive ▲* L3 cache – 4 MB local & remote per 4-core CCX, 2 CCXs per chiplet, 16-way non-inclusive. Up to 16 MB on desktop CPUs and 64 MB on server CPUs === Intel Kaby Lake (2016) === ▲* L1 cache (instruction and data) – 64 kBKB per core * L2 cache – 256 KB per core ▲* L3 cache – 2 MB to 8 MB shared<ref name=":3">{{Cite web\|url=~~http~~https://ark.intel.com/\|title=Intel Kaby Lake Microrchitecture}}</ref> === Intel Broadwell (2014) === ▲* L1 cache (instruction and data) – 64 {{abbr\|kBKB\|kilobytes}} per core * L2 cache – 256 KK per core * L3 cache – 2 {{Abbr\|MB\|megabytes}} to 6 MB shared ▲* L4 cache – 128 MB of eDRAM (Iris Pro models only)<ref name=":2">{{Cite web\|url=~~http~~https://ark.intel.com/\|title=Intel Broadwell Microarchitecture}}</ref> === IBM POWER7 (2010) === * L1 cache (instruction and data) – each 64-banked, each bank has ~~2rd~~2nd+1wr ports 32 kBKB, 8-way associative, 128B block, write through * L2 cache – 256 kBKB, 8-way, 128B block, write back, inclusive of L1, 2 ns access latency * L3 cache – 8 regions of 4 MB (total 32 MB), local region 6 ns, remote 30 ns, each region 8-way associative, DRAM data array, SRAM tag array<ref>{{Cite web\|url=https://www-03.ibm.com/systems/power/hardware/795/specs.html\|archive-url=https://web.archive.org/web/20100821102938/http://www-03.ibm.com/systems/power/hardware/795/specs.html\|url-status=dead\|archive-date=August 21, 2010\|title=IBM Power7}}</ref> == See also == * CPU microarchitectures mentioned in this article: [[POWER7]] [[Broadwell (microarchitecture)\|Intel Broadwell ~~Microarchitecture~~]] * [[Kaby Lake\|Intel Kaby Lake Microarchitecture]] [[Zen (microarchitecture)\|AMD Zen]] [[Apple silicon\|Apple Silicon]] * [[CPU cache]] * [[Memory hierarchy]] * [[CAS latency\|CAS latency (RAM)]] * [[Cache (computing)]] Line 141 ⟶ 161: [[Category:Computer hardware]] [[Category:Computer memory]] ~~[[fa: سلسله مراتب حافظه نهان]]~~