Content deleted Content added
Same for cost. |
→Recent implementation models: copyediting. |
||
(43 intermediate revisions by 33 users not shown) | |||
Line 1:
{{Short description|Memory hierarchy concept applied to CPU caches with multiple levels}}
'''Cache hierarchy,''' or '''multi-level
Cache hierarchy is a form and part of [[memory hierarchy]]
[[File:Cache Organization.png|thumb|right|429x429px|Generic multi-level cache organization|alt=Process architecture diagram showing four independent processors each linked through cache systems to main memory and input-output system
== Background ==
In the history of computer and electronic chip development, there was a period when increases in CPU speed outpaced the improvements in memory access speed.<ref>Ronald D. Miller; Lars I. Eriksson; Lee A Fleisher, 2014. Miller's Anesthesia E-Book. Elsevier Health Sciences. p. 75. {{ISBN|978-0-323-28011-2}}.</ref> The gap between the speed of CPUs and memory meant that the CPU would often be idle.<ref>Albert Y. Zomaya, 2006. Handbook of Nature-Inspired and Innovative Computing: Integrating Classical Models with Emerging Technologies. Springer Science & Business Media. p. 298. {{ISBN|978-0-387-40532-2}}.</ref> CPUs were increasingly capable of running and executing larger amounts of instructions in a given time, but the time needed to access data from main memory prevented programs from fully benefiting from this capability.<ref>Richard C. Dorf, 2018. Sensors, Nanoscience, Biomedical Engineering, and Instruments: Sensors Nanoscience Biomedical Engineering. CRC Press. p. 4. {{ISBN|978-1-4200-0316-1}}.</ref> This issue motivated the creation of memory models with higher access rates in order to realize the potential of faster processors.<ref>David A. Patterson; John L. Hennessy, 2004. Computer Organization and Design: The Hardware/Software Interface, Third Edition. Elsevier. p. 552. {{ISBN|978-0-08-050257-1}}.</ref>
This resulted in the concept of [[cache memory]], first proposed by [[Maurice Wilkes]], a British computer scientist at the University of Cambridge in 1965. He called such memory models "slave memory".<ref>{{Cite news|url=https://www.britannica.com/biography/Maurice-Vincent-Wilkes|title=Sir Maurice Vincent Wilkes {{!}} British computer scientist|newspaper=Encyclopædia Britannica|access-date=2016-12-11}}</ref> Between roughly 1970 and 1990, papers and articles by [[Anant Agarwal]], [[Alan Jay Smith]], [[Mark D. Hill]],
== Multi-level cache ==
Line 14 ⟶ 15:
=== Average access time (AAT) ===
Caches, being small in size, may result in frequent misses – when a search of the cache does not provide the sought-after information – resulting in a call to main memory to fetch data. Hence, the AAT is affected by the miss rate of each structure from which it searches for the data.<ref name=":1">{{Cite book|title=Computer Architecture: A Quantitative Approach|last=Hennessey and Patterson
: <math> \text{AAT} = \text{hit time}+((\text{miss rate})\times(\text{miss penalty}))</math>
AAT for main memory is given by Hit time <sub>main memory</sub>. AAT for caches can be given by:
:
The hit time for caches is less than the hit time for the main memory, so the AAT for data retrieval is significantly lower when accessing data through the cache rather than main memory.<ref>Cetin Kaya Koc, 2008. Cryptographic Engineering. Springer Science & Business Media. pp. 479–480. {{ISBN|978-0-387-71817-0}}.</ref>
Line 28 ⟶ 29:
=== Evolution ===
[[File:Cache_Hierarchy_Updated.png|thumb|250x250px|Cache hierarchy for up to L3 level of cache and main memory with on-chip L1|alt=A series of rectangles of increasing proportions representing increasing memory from on-CPU registers and L1 cache through L2, L3, and main memory
In the case of a cache miss, the purpose of using such a structure will be rendered useless and the computer will have to go to the main memory to fetch the required data. However, with a [[
=== Performance gains ===
With the technology-scaling that allowed memory systems able to be accommodated on a single chip, most modern day processors have up to three or four cache levels.<ref>{{cite web |last1=Gayde |first1=William |title=How CPUs are Designed and Built |url=https://www.techspot.com/article/1821-how-cpus-are-designed-and-built/ |website=Techspot |
''Example'': main memory = 50 {{abbr|ns|nanoseconds}}, L1 = 1 ns with 10% miss rate, L2 = 5 ns
* No cache, AAT = 50 ns
* L1 cache, AAT = 1 ns + (0.1 × 50 ns) = 6 ns
* L1–2 caches, AAT = 1 ns + (0.1 × [5 ns + (0.01 × 50 ns)]) = 1.55 ns
* L1–3 caches, AAT = 1 ns + (0.1 × [5 ns + (0.01 × [10 ns + (0.002 × 50 ns)])]) = 1.
=== Disadvantages ===
* Cache memory comes at an increased [[marginal cost]] than main memory and thus can increase the cost of the overall system.<ref>Vojin G. Oklobdzija, 2017. Digital Design and Fabrication. CRC Press. p. 4. {{ISBN
* Cached data is stored only so long as power is provided to the cache.
* Increased on-chip area required for memory system.<ref>{{Cite web|url=https://www.bottomupcs.com/memory.xhtml|title=Memory Hierarchy
* Benefits may be minimized or eliminated in the case of a large programs with poor [[temporal locality]], which frequently access the main memory.<ref name=":4" />
== Properties ==
[[File:Separate unified.png|thumb|Cache organization with L1 as separate and L2 as unified|255x255px|alt=three squares showing separated on-CPU L1 caches for instructions and data, an off-chip L2 cache, and main memory
=== Banked versus unified ===
In a banked cache, the cache is divided into a cache dedicated to [[
Modern processors have split caches, and in systems with multilevel caches lower level caches may be unified while higher levels split.<ref name="CA:QA" /><ref>Alan Clements, 2013. Computer Organization & Architecture: Themes and Variations. Cengage Learning. p. 588. {{ISBN|1-285-41542-6}}.</ref>
=== Inclusion policies ===
[[File:Inclusivecache.png|thumb|Inclusive cache organization|413x413px|alt=a memory system diagram showing a copy of the L1 within L2 and a copy of the L2 within L3
Whether a block present in the upper cache layer can also be present in the lower cache level is governed by the memory system's [[Cache inclusion policy|inclusion policy]], which may be inclusive, exclusive or non-inclusive non-exclusive (NINE).
With an inclusive policy, all the blocks present in the upper-level cache have to be present in the lower-level cache as well. Each upper-level cache component is a subset of the lower-level cache component. In this case, since there is a duplication of blocks, there is some wastage of memory. However, checking is faster.
Under an exclusive policy, all the cache hierarchy components are completely exclusive, so that any element in the upper-level cache will not be present in any of the lower cache components. This enables complete usage of the cache memory. However, there is a high memory-access latency.<ref>{{Cite web|url=
The above policies require a set of rules to be followed in order to implement them. If none of these are forced, the resulting inclusion policy is called non-inclusive non-exclusive (NINE). This means that the upper-level cache may or may not be present in the lower-level cache.<ref name=":4">{{Cite book|title=Fundamentals of Parallel Multicore Architecture|last=Solihin|first=Yan|publisher=Chapman and Hall|year=2016|isbn=9781482211184
=== Write policies ===
There are two policies which define the way in which a modified cache block will be updated in the main memory: write through and write back.
In the case of write through policy, whenever the value of the cache block changes, it is further modified in the lower-level memory hierarchy as well.<ref>David A. Patterson; John L. Hennessy; 2017. Computer Organization and Design RISC-V Edition: The Hardware Software Interface. Elsevier Science. pp. 386–387. {{ISBN|978-0-12-812276-1}}.</ref> This policy ensures that the data is stored safely as it is written throughout the hierarchy.
However, in the case of the write back policy, the changed cache block will be updated in the lower-level hierarchy only when the cache block is evicted. A "dirty bit" is attached to each cache block and set whenever the cache block is modified.<ref>Stefan Goedecker; Adolfy Hoisie; 2001. Performance Optimization of Numerically Intensive Codes. SIAM. p. 11. {{ISBN|978-0-89871-484-5}}.</ref> During eviction, blocks with a set dirty bit will be written to the lower-level hierarchy. Under this policy, there is a risk for data-loss as the most recently changed copy of a datum is only stored in the cache and therefore some corrective techniques must be observed.
In case of a write where the byte is not present in the cache block, the byte may be brought to the cache as determined by a write allocate or write no-allocate policy.<ref name=":0">{{Cite book|title=Fundamentals of Parallel Computer Architecture|last=Solihin|first=Yan|publisher=Solihin Publishing|year=2009|isbn=9780984163007|pages=Chapter 6: Introduction to Memory Hierarchy Organization}}</ref> Write allocate policy states that in case of a write miss, the block is fetched from the main memory and placed in the cache before writing.<ref>Harvey G. Cragon, 1996. Memory Systems and Pipelined Processors. Jones & Bartlett Learning. p. 47. {{ISBN|978-0-86720-474-2}}.</ref> In the write no-allocate policy, if the block is missed in the cache it will write in the lower-level memory hierarchy without fetching the block into the cache.<ref>David A. Patterson; John L. Hennessy; 2007. Computer Organization and Design, Revised Printing, Third Edition: The Hardware/Software Interface. Elsevier. p. 484. {{ISBN|978-0-08-055033-6}}.</ref>
The common combinations of the policies are [[Cache (computing)#Writing policies|"write
=== Shared versus private ===
[[File:Shared private.png|thumb|Cache organization with L1 private and L2 and L3 shared|alt=Three CPUs each have private on-chip L1 caches but share the off-chip L2, L3, and main memory.]]
A private cache is assigned to one particular core in a processor, and cannot be accessed by any other cores. In some architectures, each core has its own private cache; this creates the risk of duplicate blocks in a system's cache architecture, which results in reduced capacity utilization. However, this type of design choice in a multi-layer cache architecture can also
A shared cache is a cache which can be accessed by multiple cores.<ref>Akanksha Jain; Calvin Lin; 2019. Cache Replacement Policies. Morgan & Claypool Publishers. p. 45. {{ISBN|978-1-68173-577-1}}.</ref> Since it is shared, each block in the cache is unique and therefore has a larger hit rate as there will be no duplicate blocks. However, data-access latency can increase as multiple cores try to access the same cache.<ref>David Culler; Jaswinder Pal Singh; Anoop Gupta; 1999. Parallel Computer Architecture: A Hardware/Software Approach. Gulf Professional Publishing. p. 436. {{ISBN|978-1-55860-343-1}}.</ref>
In [[multi-core processor]]s, the design choice to make a cache shared or private impacts the performance of the processor.<ref name="Keckler (2009)">Stephen W. Keckler; Kunle Olukotun; H. Peter Hofstee; 2009. Multicore Processors and Systems. Springer Science & Business Media. p. 182. {{ISBN|978-1-4419-0263-4}}.</ref> In practice, the upper-level cache L1 (or sometimes L2)<ref name=":2" /><ref name=":3" /> is implemented as private and lower-level caches are implemented as shared.
== Recent implementation models ==
[[File:Nehalem EP.png|thumb|387x387px|Cache organization of Intel Nehalem microarchitecture<ref>{{Cite web|url=
=== Intel
Up to 64-core:
* L1 cache (instruction and data) – 64 {{abbr|kB|kilobytes}} per core▼
*
*
* L3 cache – 5 {{abbr|MB|megabytes}} per core (i.e., up to 320 {{abbr|MB|megabytes}} total)
* L4 cache – 128 MB of eDRAM (Iris Pro models only)<ref name=":2">{{Cite web|url=http://ark.intel.com/|title=Intel Broadwell Microarchitecture|last=|first=|date=|website=|publisher=|access-date=}}</ref>▼
=== Intel
6-core (performance | efficiency):
* L1 cache (instruction and data) – 64 kB per core▼
*
* L2 cache – 2 {{abbr|MB|megabytes}} per core | 4–8 {{abbr|MB|megabytes}} semi-shared
* L3 cache – 2 MB to 8 MB shared<ref name=":3">{{Cite web|url=http://ark.intel.com/|title=Intel Kaby Lake Microrchitecture|last=|first=|date=|website=|publisher=|access-date=}}</ref>▼
* L3 cache – 20–24 {{abbr|MB|megabytes}} shared
=== AMD EPYC 9684X (Zen 4, 2023) ===
96-core:
* L1 cache – 64 {{abbr|KB|kilobytes}} per core
* L2 cache – 1 {{abbr|MB|megabytes}} per core
* L3 cache – 1152 {{abbr|MB|megabytes}} shared
=== Apple M1 Ultra (2022) ===
20-core (4:1 "performance" core | "efficiency" core):
* L1 cache – 320|192 {{abbr|KB|kilobytes}} per core
* L2 cache – 52 {{abbr|MB|megabytes}} semi-shared
* L3 cache – 96 {{abbr|MB|megabytes}} shared
=== AMD Zen 3 (2022) ===
6- to 16-core:
* L1 cache – 64 {{abbr|KB|kilobytes}} per core
* L2 cache – 1 {{abbr|MB|megabytes}} per core
* L3 cache – 32 to 128 {{abbr|MB|megabytes}} shared
=== AMD Zen 2 (2019) ===
* L1 cache – 32 KB data & 32 KB instruction per core, 8-way
* L2 cache – 512 KB per core, 8-way inclusive
* L3 cache – 16 MB local per 4-core CCX, 2 CCXs per chiplet, 16-way non-inclusive. Up to 64 MB on desktop CPUs and 256 MB on server CPUs
=== AMD Zen (2017) ===
* L1 cache – 32 KB data & 64 KB instruction per core, 4-way
* L2 cache – 512 KB per core, 4-way inclusive
* L3 cache – 4 MB local & remote per 4-core CCX, 2 CCXs per chiplet, 16-way non-inclusive. Up to 16 MB on desktop CPUs and 64 MB on server CPUs
=== Intel Kaby Lake (2016) ===
* L2 cache – 256 KB per core
▲* L3 cache – 2 MB to 8 MB shared<ref name=":3">{{Cite web|url=
=== Intel Broadwell (2014) ===
* L2 cache – 256 KK per core
* L3 cache – 2 {{Abbr|MB|megabytes}} to 6 MB shared
▲* L4 cache – 128 MB of eDRAM (Iris Pro models only)<ref name=":2">{{Cite web|url=
=== IBM
* L1 cache (instruction and data) – each 64-banked, each bank has
* L2 cache – 256
* L3 cache – 8 regions of 4 MB (total 32 MB), local region 6 ns, remote 30 ns, each region 8-way associative, DRAM data array, SRAM tag array<ref>{{Cite web|url=
== See also ==
* CPU microarchitectures mentioned in this article:
** [[POWER7
** [[Broadwell (microarchitecture)|Intel Broadwell
** [[
** [[Apple silicon|Apple Silicon]]
* [[CPU cache]]
* [[Memory hierarchy]]
* [[CAS latency|CAS latency (RAM)]]
* [[Cache (computing)]]
|