Content deleted Content added
→Idealized cache model: typo Tags: Mobile edit Mobile web edit Advanced mobile edit |
Rescuing 1 sources and tagging 0 as dead.) #IABot (v2.0.9.5 |
||
(5 intermediate revisions by 5 users not shown) | |||
Line 1:
{{Short description|I/O-efficient algorithm regardless of cache size}}
In [[computing]], a '''cache-oblivious algorithm''' (or cache-transcendent algorithm) is an [[algorithm]] designed to take advantage of a processor [[CPU cache|cache]] without having the size of the cache (or the length of the [[cache line]]s, etc.) as an explicit parameter. An '''optimal cache-oblivious algorithm''' is a cache-oblivious algorithm that uses the cache optimally (in an [[asymptotic notation|asymptotic]] sense, ignoring constant factors).
Optimal cache-oblivious algorithms are known for [[cache-oblivious matrix multiplication|matrix multiplication]], [[matrix transposition]], [[funnelsort|sorting]], and several other problems. Some more general algorithms, such as [[Cooley–Tukey FFT algorithm|Cooley–Tukey FFT]], are optimally cache-oblivious under certain choices of parameters.
Typically, a cache-oblivious algorithm works by a [[recursion|recursive]] [[divide-and-conquer algorithm]], where the problem is divided into smaller and smaller subproblems.
==History==
The idea (and name) for cache-oblivious algorithms was conceived by [[Charles E. Leiserson]] as early as 1996 and first published by [[Harald Prokop]] in his master's thesis at the [[Massachusetts Institute of Technology]] in 1999.<ref name="Prokop">Harald Prokop. [http://supertech.csail.mit.edu/papers/Prokop99.pdf Cache-Oblivious Algorithms] {{Webarchive|url=https://web.archive.org/web/20191124114139/http://supertech.csail.mit.edu/papers/Prokop99.pdf |date=2019-11-24 }}. Masters thesis, MIT. 1999.</ref> There were many predecessors, typically analyzing specific problems; these are discussed in detail in Frigo et al. 1999. Early examples cited include Singleton 1969 for a recursive Fast Fourier Transform, similar ideas in Aggarwal et al. 1987, Frigo 1996 for matrix multiplication and LU decomposition, and [[Todd Veldhuizen]] 1996 for matrix algorithms in the [[Blitz++]] library.
==Idealized cache model==
In general, a [[computer program|program]] can be made more cache-conscious:<ref name="nick05">{{cite
*''[[
*''Spatial locality'', where the subsequent memory accesses are adjacent or nearby [[memory address]]es.
Cache-oblivious algorithms are typically analyzed using an idealized model of the cache, sometimes called the '''cache-oblivious model'''.
In particular, the cache-oblivious model is an [[abstract machine]] (i.e., a theoretical [[model of computation]]). It is similar to the [[RAM model|RAM machine model]] which replaces the [[Turing machine]]'s infinite tape with an infinite array. Each ___location within the array can be accessed in <math>O(1)</math> time, similar to the [[random-access memory]] on a real computer. Unlike the RAM machine model, it also introduces a cache: the second level of storage between the RAM and the CPU. The other differences between the two models are listed below. In the cache-oblivious model:
Line 25:
*A cache miss results in one block being loaded from the main memory into the cache. Namely, if the CPU tries to access word <math>w</math> and <math>x</math> is the line containing <math>w</math>, then <math>x</math> is loaded into the cache. If the cache was previously full, then a line will be evicted as well (see replacement policy below).
*The cache holds <math>M</math> objects, where <math>M = \Omega(B^2)</math>. This is also known as the ''tall cache assumption''.
*The cache is fully associative: each line can be loaded into any ___location in the cache.<ref name="Kumar">{{cite journal |last=Kumar |first=Piyush |title=Cache-Oblivious Algorithms |journal=Algorithms for Memory Hierarchies |series=LNCS 2625 |pages=193–212 |publisher=Springer Verlag |citeseerx=10.1.1.150.5426 }}</ref>
*The replacement policy is optimal. In other words, the cache is assumed to be given the entire sequence of memory accesses during algorithm execution. If it needs to evict a line at time <math>t</math>, it will look into its sequence of future requests and evict the line whose first access is furthest in the future. This can be emulated in practice with the [[Least Recently Used]] policy, which is shown to be within a small constant factor of the offline optimal replacement strategy<ref name="Frigo99">{{cite conference |last1=Frigo |first1=M. |
▲*The replacement policy is optimal. In other words, the cache is assumed to be given the entire sequence of memory accesses during algorithm execution. If it needs to evict a line at time <math>t</math>, it will look into its sequence of future requests and evict the line whose first access is furthest in the future. This can be emulated in practice with the [[Least Recently Used]] policy, which is shown to be within a small constant factor of the offline optimal replacement strategy<ref name="Frigo99">{{cite conference |first1=M. |last1=Frigo |first2=C. E. |last2=Leiserson |authorlink2=Charles Leiserson |first3=H. |last3=Prokop |authorlink3=Harald Prokop |first4=S. |last4=Ramachandran |title=Cache-oblivious algorithms |conference=Proc. IEEE Symp. on Foundations of Computer Science (FOCS) |pages=285–297 |year=1999 |url=http://supertech.csail.mit.edu/papers/FrigoLePr99.pdf}}</ref><ref name="Sleator">Daniel Sleator, Robert Tarjan. [https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.367.6317&rep=rep1&type=pdf Amortized Efficiency of List Update and Paging Rules]. In ''Communications of the ACM'', Volume 28, Number 2, pp. 202–208. Feb 1985.</ref>
To measure the complexity of an algorithm that executes within the cache-oblivious model, we measure the number of [[cache miss]]es that the algorithm experiences. Because the model captures the fact that accessing elements in the [[cache (computing)|cache]] is much faster than accessing things in [[main memory]], the [[running time]] of the algorithm is defined only by the number of memory transfers between the cache and main memory. This is similar to the [[external memory model]], which all of the features above, but cache-oblivious algorithms are independent of cache parameters (<math>B</math> and <math>M</math>).<ref name="Demaine02">[[Erik Demaine]]. [http://erikdemaine.org/papers/BRICS2002/ Cache-Oblivious Algorithms and Data Structures], in Lecture Notes from the EEF Summer School on Massive Data Sets, BRICS, University of Aarhus, Denmark, June 27–July 1, 2002.</ref> The benefit of such an algorithm is that what is efficient on a cache-oblivious machine is likely to be efficient across many real machines without fine-tuning for particular real machine parameters. For many problems, an optimal cache-oblivious algorithm will also be optimal for a machine with more than two [[memory hierarchy]] levels.<ref name="Frigo99"/>
==Examples==
[[File:
The simplest cache-oblivious algorithm presented in Frigo et al. is an out-of-place [[matrix transpose]] operation ([[in-place matrix transposition|in-place algorithms]] have also been devised for transposition, but are much more
[[File:Matrix_transpose_dc.svg|thumb|right|Principle of cache-oblivious algorithm for matrix transposition using a divide and conquer-approach. The graphic shows the recursive step ('''a''' → '''b''') of dividing the matrix and transposing each part individually.]]
The cache-oblivious algorithm has optimal work complexity <math>O(mn)</math> and optimal cache complexity <math>O(1+mn/B)</math>. The basic idea is to reduce the transpose of two large matrices into the transpose of small (sub)matrices. We do this by dividing the matrices in half along their larger dimension until we just have to perform the transpose of a matrix that will fit into the cache.
(In principle, one could continue dividing the matrices until a base case of size 1×1 is reached, but in practice one uses a larger base case (e.g. 16×16) in order to [[amortized analysis|amortize]] the overhead of the recursive subroutine calls.)
Line 50 ⟶ 42:
Most cache-oblivious algorithms rely on a divide-and-conquer approach. They reduce the problem, so that it eventually fits in cache no matter how small the cache is, and end the recursion at some small size determined by the function-call overhead and similar cache-unrelated optimizations, and then use some cache-efficient access pattern to merge the results of these small, solved problems.
Like [[external sorting]] in the [[external memory model]], cache-oblivious sorting is possible in two variants: [[funnelsort]], which resembles [[mergesort]]
==
An empirical comparison of 2 RAM-based, 1 cache-aware, and 2 cache-oblivious algorithms implementing [[priority queue]]s found that:<ref>{{cite thesis |title=Cache-Oblivious Algorithms in Practice |type=Master's |last1=Olsen |first1=Jesper Holm |last2=Skov |first2=Søren Christian |publisher=University of Copenhagen
* Cache-oblivious algorithms performed worse than RAM-based and cache-aware algorithms when data fits into main memory.▼
▲* Cache-oblivious algorithms performed worse than RAM-based and cache-aware algorithms when data fits into main memory
* The cache-aware algorithm did not seem significantly more complex to implement than the cache-oblivious algorithms, and offered the best performance in all cases tested in the study.
* Cache oblivious algorithms outperformed RAM-based algorithms when data size exceeded the size of main memory.
Another study compared [[hash table]]s (as RAM-based or cache-unaware), [[B-tree]]s (as cache-aware), and a cache-oblivious data structure referred to as a "Bender set". For both execution time and memory usage, the hash table was best, followed by the B-tree, with the Bender set the worst in all cases. The memory usage for all tests did not exceed main memory. The hash tables were described as easy to implement, while the Bender set "required a greater amount of effort to implement correctly".<ref>{{cite web |last1=Verver |first1=Maks |title=Evaluation of a Cache-Oblivious Data Structure |date=23 June 2008 |url=https://fmt.ewi.utwente.nl/media/61.pdf |access-date=3 January 2022}}</ref>
==See also==
Line 70 ⟶ 61:
{{Reflist}}
[[Category:Analysis of algorithms]]▼
[[Category:Cache (computing)]]▼
[[Category:Models of computation]]
▲[[Category:Cache (computing)]]
▲[[Category:Analysis of algorithms]]
|