CPU cache: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Added work. | Use this bot. Report bugs. | Suggested by CorrectionsJackal | Category:Computer memory | #UCB_Category 49/194
Make usage of higher/lower cache levels consistent. Hennessy and Patterson describe L1 cache as the highest-level cache.
Line 21:
==History==
[[File:NeXTcube motherboard.jpg|thumb|[[Motherboard]] of a [[NeXTcube]] computer (1990). At the lower edge of the image left from the middle, there is the CPU [[Motorola 68040]] operated at 25 [[MHz]] with two separate level 1 caches of 4 KiB each on the chip, one for the instructions and one for data. The board has no external L2 cache.]]
Early examples of CPU caches include the [[Titan (1963 computer)|Atlas 2]]<ref>{{cite web|last=Landy|first=Barry|url=http://www.chilton-computing.org.uk/acl/technology/atlas50th/p005.htm|title=Atlas 2 at Cambridge Mathematical Laboratory (and Aldermaston and CAD Centre)|date=November 2012|quote=Two tunnel diode stores were developed at Cambridge; one, which worked very well, speeded up the fetching of operands, the other was intended to speed up the fetching of instructions. The idea was that most instructions are obeyed in sequence, so when an instruction was fetched that word was placed in the slave store in the ___location given by the fetch address modulo 32; the remaining bits of the fetch address were also stored. If the wanted word was in the slave it was read from there instead of main memory. This would give a major speedup to instruction loops up to 32 instructions long, and reduced effect for loops up to 64 words.}}</ref> and the [[IBM System/360 Model 85]]<ref>{{cite web|url=http://www.bitsavers.org/pdf/ibm/360/functional_characteristics/A22-6916-1_360-85_funcChar_Jun68.pdf|title=IBM System/360 Model 85 Functional Characteristics|publisher=[[IBM]]|id=A22-6916-1|date=June 1968}}</ref><ref>{{cite journal|url=https://www.andrew.cmu.edu/course/15-440/assets/READINGS/liptay1968.pdf|last=Liptay|first=John S.|title=Structural aspects of the System/360 Model 85 - Part II The cache|journal=IBM Systems Journal|date=March 1968|volume=7|issue=1|pages=15–21|doi=10.1147/sj.71.0015}}</ref> in the 1960s. The first CPUs that used a cache had only one level of cache; unlike later level 1 cache, it was not split into L1d (for data) and L1i (for instructions). Split L1 cache started in 1976 with the [[IBM 801]] CPU,<ref>{{cite journal|url=http://home.eng.iastate.edu/~zzhang/courses/cpre585-f03/reading/smith-csur82-cache.pdf|title=Cache Memories|last=Smith |first=Alan Jay|journal=Computing Surveys|volume=14|issue=3|date=September 1982|pages=473–530|doi=10.1145/356887.356892|s2cid=6023466}}</ref><ref>{{cite journal|title=Altering Computer Architecture is Way to Raise Throughput, Suggest IBM Researchers|journal=[[Electronics (magazine)|Electronics]]|volume=49|issue=25|date=December 1976|pages=30–31}}</ref> became mainstream in the late 1980s, and in 1997 entered the embedded CPU market with the ARMv5TE. In 2015, even sub-dollar [[System on a chip|SoCs]] split the L1 cache. They also have L2 caches and, for larger processors, L3 caches as well. The L2 cache is usually not split, and acts as a common repository for the already split L1 cache. Every core of a [[multi-core processor]] has a dedicated L1 cache and is usually not shared between the cores. The L2 cache, and higherlower-level caches, may be shared between the cores. L4 cache is currently uncommon, and is generally [[dynamic random-access memory]] (DRAM) on a separate die or chip, rather than [[static random-access memory]] (SRAM). An exception to this is when [[eDRAM]] is used for all levels of cache, down to L1. Historically L1 was also on a separate die, however bigger die sizes have allowed integration of it as well as other cache levels, with the possible exception of the last level. Each extra level of cache tends to be biggersmaller and optimizedfaster differentlythan the lower levels.<ref name=":0" />
 
Caches (like for RAM historically) have generally been sized in powers of: 2, 4, 8, 16 etc. [[Kibibyte|KiB]]; when up to [[Mebibyte|MiB]] sizes (i.e. for larger non-L1), very early on the pattern broke down, to allow for larger caches without being forced into the doubling-in-size paradigm, with e.g. [[Intel Core 2 Duo]] with 3&nbsp;MiB L2 cache in April 2008. This happened much later for L1 caches, as their size is generally still a small number of KiB. The [[IBM zEC12 (microprocessor)|IBM zEC12]] from 2012 is an exception however, to gain unusually large 96&nbsp;KiB L1 data cache for its time, and e.g. the [[IBM z13 (microprocessor)|IBM z13]] having a 96&nbsp;KiB L1 instruction cache (and 128&nbsp;KiB L1 data cache),<ref>{{cite web|last1=White|first1=Bill|last2=De Leon|first2=Cecilia A.|display-authors=etal |url=https://www.redbooks.ibm.com/redbooks/pdfs/sg248250.pdf|title=IBM z13 and IBM z13s Technical Introduction|page=20|date=March 2016|publisher=IBM}}</ref> and Intel [[Ice Lake (microprocessor)|Ice Lake]]-based processors from 2018, having 48&nbsp;KiB L1 data cache and 48&nbsp;KiB L1 instruction cache. In 2020, some [[Intel Atom]] CPUs (with up to 24 cores) have (multiple of) 4.5&nbsp;MiB and 15&nbsp;MiB cache sizes.<ref>{{Cite press release|url=https://www.intel.com/content/www/us/en/newsroom/news/product-fact-sheet-accelerating-5g-network-infrastructure-core-edge.html|publisher=Intel Corporation |date=25 February 2020|title=Product Fact Sheet: Accelerating 5G Network Infrastructure, from the Core to the Edge|website=Intel Newsroom|quote=L1 cache of 32KB/core, L2 cache of 4.5MB per 4-core cluster and shared LLC cache up to 15MB.|language=en-US|access-date=2024-04-18}}</ref><ref>{{Cite web|url=https://www.anandtech.com/show/15544/intel-launches-atom-p5900-a-10nm-atom-for-radio-access-networks|title=Intel Launches Atom P5900: A 10nm Atom for Radio Access Networks|last=Smith|first=Ryan|website=AnandTech |access-date=2020-04-12}}</ref>
Line 112:
The "size" of the cache is the amount of main memory data it can hold. This size can be calculated as the number of bytes stored in each data block times the number of blocks stored in the cache. (The tag, flag and [[ECC memory#Cache|error correction code]] bits are not included in the size,<ref>{{cite web |author=Sadler |first1=Nathan N. |last2=Sorin |first2=Daniel L. |year=2006 |title=Choosing an Error Protection Scheme for a Microprocessor's L1 Data Cache |url=https://people.ee.duke.edu/~sorin/papers/iccd06_perc.pdf |page=4}}</ref> although they do affect the physical area of a cache.)
 
An effective memory address which goes along with the cache line (memory block) is split ([[Most significant bit|MSB]] to [[Least significant bit|LSB]]) into the tag, the index and the block offset.<ref name=":0">{{cite book |last1=Hennessy |first1=John L. |url=https://books.google.com/books?id=v3-1hVwHnHwC&q=Hennessey+%22block+offset%22&pg=PA120 |title=Computer Architecture: A Quantitative Approach |last2=Patterson |first2=David A. |publisher=Elsevier |year=2011 |isbn=978-0-12-383872-8 |page=B-9 |language=en}}</ref><ref>{{cite book |last1=Patterson |first1=David A. |url=https://books.google.com/books?id=3b63x-0P3_UC&q=Hennessey+%22block+offset%22&pg=PA484 |title=Computer Organization and Design: The Hardware/Software Interface |last2=Hennessy |first2=John L. |publisher=Morgan Kaufmann |year=2009 |isbn=978-0-12-374493-7 |page=484 |language=en}}</ref>
 
{| style="width:30%; text-align:center" border="1"
Line 210:
Modern processors have multiple interacting on-chip caches. The operation of a particular cache can be completely specified by the cache size, the cache block size, the number of blocks in a set, the cache set replacement policy, and the cache write policy (write-through or write-back).<ref name="ccs.neu.edu" />
 
While all of the cache blocks in a particular cache are the same size and have the same associativity, typically the "lowerhigher-level" caches (called Level 1 cache) have a smaller number of blocks, smaller block size, and fewer blocks in a set, but have very short access times. "HigherLower-level" caches (i.e. Level 2 and above) have progressively larger numbers of blocks, larger block size, more blocks in a set, and relatively longer access times, but are still much faster than main memory.<ref name=":0" />
 
Cache entry replacement policy is determined by a [[cache algorithm]] selected to be implemented by the processor designers. In some cases, multiple algorithms are provided for different kinds of work loads.
Line 297:
When considering a chip with [[Multi-core processor|multiple cores]], there is a question of whether the caches should be shared or local to each core. Implementing shared cache inevitably introduces more wiring and complexity. But then, having one cache per ''chip'', rather than ''core'', greatly reduces the amount of space needed, and thus one can include a larger cache.
 
Typically, sharing the L1 cache is undesirable because the resulting increase in latency would make each core run considerably slower than a single-core chip. However, for the highestlowest-level cache, the last one called before accessing memory, having a global cache is desirable for several reasons, such as allowing a single core to use the whole cache, reducing data redundancy by making it possible for different processes or threads to share cached data, and reducing the complexity of utilized cache coherency protocols.<ref>{{cite web |last1=Tian |first1=Tian |last2=Shih |first2=Chiu-Pi |date=2012-03-08 |title=Software Techniques for Shared-Cache Multi-Core Systems |url=https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems |access-date=2015-11-24 |publisher=[[Intel]]}}</ref> For example, an eight-core chip with three levels may include an L1 cache for each core, one intermediate L2 cache for each pair of cores, and one L3 cache shared between all cores.
 
A shared highestlowest-level cache, which is called before accessing memory, is usually referred to as a ''last level cache'' (LLC). Additional techniques are used for increasing the level of parallelism when LLC is shared between multiple cores, including slicing it into multiple pieces which are addressing certain ranges of memory addresses, and can be accessed independently.<ref name=":0" /><ref>{{cite web |author=Lempel |first=Oded |date=2013-07-28 |title=2nd Generation Intel Core Processor Family: Intel Core i7, i5 and i3 |url=http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.911-Sandy-Bridge-Lempel-Intel-Rev%207.pdf |url-status=dead |archive-url=https://web.archive.org/web/20200729000210/http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.911-Sandy-Bridge-Lempel-Intel-Rev%207.pdf |archive-date=2020-07-29 |access-date=2014-01-21 |website=hotchips.org |pages=7&ndash;10, 31&ndash;45}}</ref>
 
====Separate versus unified====