Content deleted Content added
Make usage of higher/lower cache levels consistent. Hennessy and Patterson describe L1 cache as the highest-level cache. |
Citation bot (talk | contribs) Added title. Removed URL that duplicated identifier. Removed access-date with no URL. Changed bare reference to CS1/2. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 353/1032 |
||
(12 intermediate revisions by 7 users not shown) | |||
Line 1:
{{Short description|Hardware cache of a central processing unit}}
A '''CPU cache''' is a [[hardware cache]] used by the [[central processing unit]] (CPU) of a [[computer]] to reduce the average cost (time or energy) to access [[data (computer science)|data]] from the [[main memory]].<ref>{{cite web |author=Torres |first=Gabriel |date=September 12, 2007 |title=How The Cache Memory Works |url=https://hardwaresecrets.com/how-the-cache-memory-works/}}</ref> A cache is a smaller, faster memory, located closer to a [[processor core]], which stores copies of the data from frequently used main [[memory ___location]]s
Cache memory is typically implemented with [[static random-access memory]] (SRAM), which requires multiple [[transistor]]s to store a single [[bit]]. This makes it expensive in terms of the area it takes up, and in modern CPUs the cache is typically the largest part by chip area. The size of the cache needs to be balanced with the general desire for smaller chips which cost less. Some modern designs implement some or all of their cache using the physically smaller [[eDRAM]], which is slower to use than SRAM but allows larger amounts of cache for any given amount of chip area.
Other types of caches exist (that are not counted towards the "cache size" of the most important caches mentioned above), such as the [[translation lookaside buffer]] (TLB) which is part of the [[memory management unit]] (MMU) which most CPUs have.▼
Most CPUs have a hierarchy of multiple cache [[#MULTILEVEL|levels]] (L1, L2, often L3, and rarely even L4), with separate instruction-specific (I-cache) and data-specific (D-cache) caches at level 1.<ref>{{Cite journal |last1=Su |first1=Chao |last2=Zeng |first2=Qingkai |date=2021-06-10 |editor-last=Nicopolitidis |editor-first=Petros |title=Survey of CPU Cache-Based Side-Channel Attacks: Systematic Analysis, Security Models, and Countermeasures |journal=Security and Communication Networks |language=en |volume=2021 |pages=1–15 |doi=10.1155/2021/5559552 |issn=1939-0122|doi-access=free }}</ref> The different levels are implemented in different areas of the chip; L1 is located as close to a CPU core as possible and thus offers the highest speed due to short signal paths, but requires careful design. L2 caches are physically separate from the CPU and operate slower, but place fewer demands on the chip designer and can be made much larger without impacting the CPU design. L3 caches are generally shared among multiple CPU cores.
▲Other types of caches exist (that are not counted towards the "cache size" of the most important caches mentioned above), such as the [[translation lookaside buffer]] (TLB) which is part of the [[memory management unit]] (MMU) which most CPUs have. [[Input/output]] sections also often contain [[data buffer]]s that serve a similar purpose.
=={{Anchor|ICACHE|DCACHE|instruction cache|data cache}}Overview==
To access data in [[main memory]], a multi-step process is used and each step introduces a delay. For instance, to read a value from memory in a simple computer system the CPU first selects the address to be accessed by expressing it on the [[address bus]] and waiting a fixed time to allow the value to settle. The memory device with that value, normally implemented in [[DRAM]], holds that value in a very low-energy form that is not powerful enough to be read directly by the CPU. Instead, it has to copy that value from storage into a small buffer which is connected to the [[data bus]]. The CPU then waits a certain time to allow this value to settle before reading the value from the data bus.
When trying to read from or write to a ___location in the main memory, the processor checks whether the data from that ___location is already in the cache. If so, the processor will read from or write to the cache instead of the much slower main memory.▼
By locating the memory physically closer to the CPU the time needed for the busses to settle is reduced, and by replacing the DRAM with SRAM, which hold the value in a form that does not require amplification to be read, the delay within the memory itself is eliminated. This makes the cache much faster both to respond and to read or write. SRAM, however, requires anywhere from four to six transistors to hold a single bit, depending on the type, whereas DRAM generally uses one transistor and one capacitor per bit, which makes it able to store much more data for any given chip area.
▲Implementing some memory in a faster format can lead to large performance improvements. When trying to read from or write to a ___location in the
Many modern [[desktop computer|desktop]], [[server (computing)|server]], and industrial CPUs have at least three independent levels of caches (L1, L2 and L3) and different types of caches:
Line 21 ⟶ 29:
==History==
[[File:NeXTcube motherboard.jpg|thumb|[[Motherboard]] of a [[NeXTcube]] computer (1990). At the lower edge of the image left from the middle, there is the CPU [[Motorola 68040]] operated at 25 [[MHz]] with two separate level 1 caches of 4 KiB each on the chip, one for the instructions and one for data. The board has no external L2 cache.]]
Early examples of CPU caches include the [[Titan (1963 computer)|Atlas 2]]<ref>{{cite web|last=Landy|first=Barry|url=http://www.chilton-computing.org.uk/acl/technology/atlas50th/p005.htm|title=Atlas 2 at Cambridge Mathematical Laboratory (and Aldermaston and CAD Centre)|date=November 2012|quote=Two tunnel diode stores were developed at Cambridge; one, which worked very well, speeded up the fetching of operands, the other was intended to speed up the fetching of instructions. The idea was that most instructions are obeyed in sequence, so when an instruction was fetched that word was placed in the slave store in the ___location given by the fetch address modulo 32; the remaining bits of the fetch address were also stored. If the wanted word was in the slave it was read from there instead of main memory. This would give a major speedup to instruction loops up to 32 instructions long, and reduced effect for loops up to 64 words.}}</ref> and the [[IBM System/360 Model 85]]<ref>{{cite web|url=http://www.bitsavers.org/pdf/ibm/360/functional_characteristics/A22-6916-1_360-85_funcChar_Jun68.pdf|title=IBM System/360 Model 85 Functional Characteristics|publisher=[[IBM]]|id=A22-6916-1|date=June 1968}}</ref><ref>{{cite journal|url=https://www.andrew.cmu.edu/course/15-440/assets/READINGS/liptay1968.pdf|last=Liptay|first=John S.|title=Structural aspects of the System/360 Model 85 - Part II The cache|journal=IBM Systems Journal|date=March 1968|volume=7|issue=1|pages=15–21|doi=10.1147/sj.71.0015}}</ref> in the 1960s. The first CPUs that used a cache had only one level of cache; unlike later level 1 cache, it was not split into L1d (for data) and L1i (for instructions). Split L1 cache started in 1976 with the [[IBM 801]] CPU,<ref>{{cite journal|url=http://home.eng.iastate.edu/~zzhang/courses/cpre585-f03/reading/smith-csur82-cache.pdf|title=Cache Memories|last=Smith |first=Alan Jay|journal=Computing Surveys|volume=14|issue=3|date=September 1982|pages=473–530|doi=10.1145/356887.356892|s2cid=6023466}}</ref><ref>{{cite journal|title=Altering Computer Architecture is Way to Raise Throughput, Suggest IBM Researchers|journal=[[Electronics (magazine)|Electronics]]|volume=49|issue=25|date=December 1976|pages=30–31}}</ref> became mainstream in the late 1980s, and in 1997 entered the embedded CPU market with the ARMv5TE.
Caches (like for RAM historically) have generally been sized in powers of: 2, 4, 8, 16 etc. [[Kibibyte|KiB]]; when up to [[Mebibyte|MiB]] sizes (i.e. for larger non-L1), very early on the pattern broke down, to allow for larger caches without being forced into the doubling-in-size paradigm, with e.g. [[Intel Core 2 Duo]] with 3 MiB L2 cache in April 2008. This happened much later for L1 caches, as their size is generally still a small number of KiB. The [[IBM zEC12 (microprocessor)|IBM zEC12]] from 2012 is an exception however, to gain unusually large 96 KiB L1 data cache for its time, and e.g. the [[IBM z13 (microprocessor)|IBM z13]] having a 96 KiB L1 instruction cache (and 128 KiB L1 data cache),<ref>{{cite web|last1=White|first1=Bill|last2=De Leon|first2=Cecilia A.|display-authors=etal |url=https://www.redbooks.ibm.com/redbooks/pdfs/sg248250.pdf|title=IBM z13 and IBM z13s Technical Introduction|page=20|date=March 2016|publisher=IBM}}</ref> and Intel [[Ice Lake (microprocessor)|Ice Lake]]-based processors from 2018, having 48 KiB L1 data cache and 48 KiB L1 instruction cache. In 2020, some [[Intel Atom]] CPUs (with up to 24 cores) have (multiple of) 4.5 MiB and 15 MiB cache sizes.<ref>{{Cite press release|url=https://www.intel.com/content/www/us/en/newsroom/news/product-fact-sheet-accelerating-5g-network-infrastructure-core-edge.html|publisher=Intel Corporation |date=25 February 2020|title=Product Fact Sheet: Accelerating 5G Network Infrastructure, from the Core to the Edge|website=Intel Newsroom|quote=L1 cache of 32KB/core, L2 cache of 4.5MB per 4-core cluster and shared LLC cache up to 15MB.|language=en-US|access-date=2024-04-18}}</ref><ref>{{Cite web|url=https://www.anandtech.com/show/15544/intel-launches-atom-p5900-a-10nm-atom-for-radio-access-networks|archive-url=https://web.archive.org/web/20200224143422/https://www.anandtech.com/show/15544/intel-launches-atom-p5900-a-10nm-atom-for-radio-access-networks|url-status=dead|archive-date=February 24, 2020|title=Intel Launches Atom P5900: A 10nm Atom for Radio Access Networks|last=Smith|first=Ryan|website=AnandTech |access-date=2020-04-12}}</ref>
==Operation==
Line 62 ⟶ 70:
The [[Cache placement policies|placement policy]] decides where in the cache a copy of a particular entry of main memory will go. If the placement policy is free to choose any entry in the cache to hold the copy, the cache is called ''fully associative''. At the other extreme, if each entry in the main memory can go in just one place in the cache, the cache is ''direct-mapped''. Many caches implement a compromise in which each entry in the main memory can go to any one of N places in the cache, and are described as N-way set associative.<ref>{{cite web |date=2010-12-02 |title=Cache design |url=https://cseweb.ucsd.edu/classes/fa10/cse240a/pdf/08/CSE240A-MBT-L15-Cache.ppt.pdf |access-date=2023-01-29 |website=ucsd.edu |pages=10–15}}</ref> For example, the level-1 data cache in an [[AMD Athlon]] is two-way set associative, which means that any particular ___location in main memory can be cached in either of two locations in the level-1 data cache.
Choosing the right value of associativity involves a [[trade-off]]. If there are ten places to which the placement policy could have mapped a memory ___location, then to check if that ___location is in the cache, ten cache entries must be searched. Checking more places takes more power and chip area, and potentially more time. On the other hand, caches with more associativity suffer fewer misses (see [[Cache performance measurement and metric#Conflict misses|conflict misses]]), so that the CPU wastes less time reading from the slow main memory. The general guideline is that doubling the associativity, from direct mapped to two-way, or from two-way to four-way, has about the same effect on raising the hit rate as doubling the cache size. However, increasing associativity more than four does not improve hit rate as much,<ref>{{Cite conference
<!-- where does "pseudo-associative cache" go in this spectrum? -->
Line 210 ⟶ 218:
Modern processors have multiple interacting on-chip caches. The operation of a particular cache can be completely specified by the cache size, the cache block size, the number of blocks in a set, the cache set replacement policy, and the cache write policy (write-through or write-back).<ref name="ccs.neu.edu" />
While all of the cache blocks in a particular cache are the same size and have the same associativity, typically the "higher-level" caches (called Level 1 cache) have a smaller number of blocks, smaller block size, and fewer blocks in a set, but have very short access times. "Lower-level" caches (i.e. Level 2 and
Cache entry replacement policy is determined by a [[cache algorithm]] selected to be implemented by the processor designers. In some cases, multiple algorithms are provided for different kinds of work loads.
Line 237 ⟶ 245:
}}</ref> variant of its [[Haswell (microarchitecture)|Haswell]] processors introduced an on-package 128 MiB [[eDRAM]] Level 4 cache which serves as a victim cache to the processors' Level 3 cache.<ref name="anandtech-i74950hq">{{cite web
| url = http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3
| archive-url = https://archive.today/20130915191303/http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3
| url-status = dead
| archive-date = September 15, 2013
| title = Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested
| publisher = [[AnandTech]]
| access-date = 2013-09-16
}}</ref> In the [[Skylake (microarchitecture)|Skylake]] microarchitecture the Level 4 cache no longer works as a victim cache.<ref>{{cite web |author=Cutress |first=Ian |date=September 2, 2015 |title=The Intel Skylake Mobile and Desktop Launch, with Architecture Analysis |url=http://www.anandtech.com/show/9582/intel-skylake-mobile-desktop-launch-architecture-analysis/5 |archive-url=https://web.archive.org/web/20150904211611/http://www.anandtech.com/show/9582/intel-skylake-mobile-desktop-launch-architecture-analysis/5 |url-status=dead |archive-date=September 4, 2015 |publisher=AnandTech}}</ref>
===={{Anchor|TRACE-CACHE}}Trace cache====
{{Main article|Trace cache}}
One of the more extreme examples of cache specialization is the '''trace cache''' (also known as ''execution trace cache'') found in the [[Intel]] [[Pentium 4]] microprocessors. A trace cache is a mechanism for increasing the instruction fetch bandwidth and decreasing power consumption (in the case of the Pentium 4) by storing traces of [[instruction (computer science)|instruction]]s that have already been fetched and decoded.<ref>{{cite web |author=Shimpi |first=Anand Lal |date=2000-11-20 |title=The Pentium 4's Cache – Intel Pentium 4 1.4 GHz & 1.5 GHz |url=http://www.anandtech.com/show/661/5 |archive-url=https://web.archive.org/web/20100526025110/http://www.anandtech.com/show/661/5 |url-status=dead |archive-date=May 26, 2010 |access-date=2015-11-30 |publisher=[[AnandTech]]}}</ref>
A trace cache stores instructions either after they have been decoded, or as they are retired. Generally, instructions are added to trace caches in groups representing either individual [[basic block]]s or dynamic instruction traces. The Pentium 4's trace cache stores [[micro-operations]] resulting from decoding x86 instructions, providing also the functionality of a micro-operation cache. Having this, the next time an instruction is needed, it does not have to be decoded into micro-ops again.<ref name="agner.org" />{{rp|63–68}}
Line 256 ⟶ 267:
A '''micro-operation cache''' ('''μop cache''', '''uop cache''' or '''UC''')<ref>{{cite web |author=Kanter |first=David |date=September 25, 2010 |title=Intel's Sandy Bridge Microarchitecture – Instruction Decode and uop Cache |url=http://www.realworldtech.com/sandy-bridge/4/ |website=Real World Technologies}}</ref> is a specialized cache that stores [[micro-operation]]s of decoded instructions, as received directly from the [[instruction decoder]]s or from the instruction cache. When an instruction needs to be decoded, the μop cache is checked for its decoded form which is re-used if cached; if it is not available, the instruction is decoded and then cached.
One of the early works describing μop cache as an alternative frontend for the Intel [[P6 (microarchitecture)|P6 processor family]] is the 2001 paper ''"Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA"''.<ref name="uop-intel">{{cite conference |conference=2001 International Symposium on Low Power Electronics and Design (ISLPED'01), August 6-7, 2001 |___location=Huntington Beach, CA, USA |last1=Solomon |first1=Baruch |book-title=ISLPED'01: Proceedings of the 2001 International Symposium on Low Power Electronics and Design |last2=Mendelson |first2=Avi |last3=Orenstein |first3=Doron |last4=Almog |first4=Yoav |last5=Ronen |first5=Ronny |date=August 2001 |publisher=[[Association for Computing Machinery]] |isbn=978-1-58113-371-4 |pages=4–9 |title=Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA |doi=10.1109/LPE.2001.945363 |access-date=2013-10-06 |url=http://cecs.uci.edu/~papers/compendium94-03/papers/2001/islped01/pdffiles/p004.pdf |s2cid=195859085}}</ref> Later, Intel included μop caches in its [[Sandy Bridge]] processors and in successive microarchitectures like [[Ivy Bridge (microarchitecture)|Ivy Bridge]] and [[Haswell (microarchitecture)|Haswell]].<ref name="agner.org">{{cite web |author=Fog |first=Agner |author-link=Agner Fog |date=2014-02-19 |title=The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers |url=http://www.agner.org/optimize/microarchitecture.pdf |access-date=2014-03-21 |website=agner.org}}</ref>{{rp|121–123}}<ref name="anandtech-haswell">{{cite web |author=Shimpi |first=Anand Lal |date=2012-10-05 |title=Intel's Haswell Architecture Analyzed |url=http://www.anandtech.com/show/6355/intels-haswell-architecture/6 |archive-url=https://archive.today/20130628103529/http://www.anandtech.com/show/6355/intels-haswell-architecture/6 |url-status=dead |archive-date=June 28, 2013 |access-date=2013-10-20 |publisher=[[AnandTech]]}}</ref> AMD implemented a μop cache in their [[Zen (microarchitecture)|Zen microarchitecture]].<ref>{{cite web |author=Cutress |first=Ian |date=2016-08-18 |title=AMD Zen Microarchitecture: Dual Schedulers, Micro-Op Cache and Memory Hierarchy Revealed |url=http://www.anandtech.com/show/10578/amd-zen-microarchitecture-dual-schedulers-micro-op-cache-memory-hierarchy-revealed |archive-url=https://archive.today/20160818171527/http://www.anandtech.com/show/10578/amd-zen-microarchitecture-dual-schedulers-micro-op-cache-memory-hierarchy-revealed |url-status=dead |archive-date=August 18, 2016 |access-date=2017-04-03 |publisher=AnandTech}}</ref>
Fetching complete pre-decoded instructions eliminates the need to repeatedly decode variable length complex instructions into simpler fixed-length micro-operations, and simplifies the process of predicting, fetching, rotating and aligning fetched instructions. A μop cache effectively offloads the fetch and decode hardware, thus decreasing [[power consumption]] and improving the frontend supply of decoded micro-operations. The μop cache also increases performance by more consistently delivering decoded micro-operations to the backend and eliminating various bottlenecks in the CPU's fetch and decode logic.<ref name="uop-intel" /><ref name="anandtech-haswell" />
Line 288 ⟶ 299:
* AMD [[Phenom II]] (2008) has up to 6 MiB on-die unified L3 cache.
* [[List of Intel Core i7 processors|Intel Core i7]] (2008) has an 8 MiB on-die unified L3 cache that is inclusive, shared by all cores.
* Intel [[Haswell (microarchitecture)|Haswell]] CPUs with integrated [[Intel Iris Pro Graphics]] have 128 MiB of eDRAM acting essentially as an L4 cache.<ref>{{cite web|url=http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3 |archive-url=https://archive.today/20130915191303/http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3 |url-status=dead |archive-date=September 15, 2013 |title=Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested |publisher=AnandTech |access-date=2014-02-25}}</ref>
Finally, at the other end of the memory hierarchy, the CPU [[register file]] itself can be considered the smallest, fastest cache in the system, with the special characteristic that it is scheduled in software—typically by a compiler, as it allocates registers to hold values retrieved from main memory for, as an example, [[loop nest optimization]]. However, with [[register renaming]] most compiler register assignments are reallocated dynamically by hardware at runtime into a register bank, allowing the CPU to break false data dependencies and thus easing pipeline hazards.
Line 297 ⟶ 308:
When considering a chip with [[Multi-core processor|multiple cores]], there is a question of whether the caches should be shared or local to each core. Implementing shared cache inevitably introduces more wiring and complexity. But then, having one cache per ''chip'', rather than ''core'', greatly reduces the amount of space needed, and thus one can include a larger cache.
Typically, sharing the L1 cache is undesirable because the resulting increase in latency would make each core run considerably slower than a single-core chip. However, for the
A shared
====Separate versus unified====
Line 347 ⟶ 358:
===More hierarchies===
<!-- (This section should be rewritten.) -->
Other processors have other kinds of predictors (e.g., the store-to-load bypass predictor in the [[Digital Equipment Corporation|DEC]] [[Alpha 21264]])
These predictors are caches in that they store information that is costly to compute. Some of the terminology used when discussing predictors is the same as that for caches (one speaks of a '''hit''' in a branch predictor), but predictors are not generally thought of as part of the cache hierarchy.
Line 386 ⟶ 397:
A more modern cache might be 16 KiB, 4-way set-associative, virtually indexed, virtually hinted, and physically tagged, with 32 B lines, 32-bit read width and 36-bit physical addresses. The read path recurrence for such a cache looks very similar to the path above. Instead of tags, virtual hints are read, and matched against a subset of the virtual address. Later on in the pipeline, the virtual address is translated into a physical address by the TLB, and the physical tag is read (just one, as the virtual hint supplies which way of the cache to read). Finally the physical address is compared to the physical tag to determine if a hit has occurred.
Some SPARC designs have improved the speed of their L1 caches by a few gate delays by collapsing the virtual address adder into the SRAM decoders. {{xref|(See [[sum-addressed decoder]].)}}
===History===
Line 447 ⟶ 458:
====In ARM microprocessors====
The [[Apple M1]] CPU has 128 or 192 KiB of instruction L1 cache for each core (important for latency/single-thread performance), depending on core type. This is an unusually large L1 cache for any CPU type (not just for a laptop); the total cache memory size is not unusually large (the total is more important for throughput) for a laptop, and much larger total (e.g. L3 or L4) sizes are available in IBM's mainframes.
====Current research====
|