Content deleted Content added
Citation bot (talk | contribs) Alter: url. URLs might have been internationalized/anonymized. | You can use this bot yourself. Report bugs here. | Suggested by AManWithNoPlan | All pages linked from cached copy of User:AManWithNoPlan/sandbox2 | via #UCB_webform_linked 325/524 |
m typo |
||
(15 intermediate revisions by 12 users not shown) | |||
Line 1:
{{Short description|Technique in computer software design}}
{{
In [[computer science]] and particularly in [[compiler]] design, '''loop nest optimization''' (LNO) is an optimization technique that applies a set of [[loop transformation]]s for the purpose of [[Locality of reference|locality]] optimization or parallelization or
The technique used to produce this optimization is called '''loop tiling''',<ref name="MuchnickAssociates1997">{{cite book|author1=Steven Muchnick|author2=Muchnick and Associates|title=Advanced Compiler Design Implementation|url=https://archive.org/details/advancedcompiler00much|url-access=registration|quote=tiling.|date=15 August 1997|publisher=Morgan Kaufmann|isbn=978-1-55860-320-2}}</ref> also known as '''loop blocking'''<ref name="CardosoDiniz2011">{{cite book|author1=João M.P. Cardoso|author2=Pedro C. Diniz|title=Compilation Techniques for Reconfigurable Architectures|url=https://books.google.com/books?id=4xwWNCiF9CgC&q=%22loop+tiling%22|date=2 April 2011|publisher=Springer Science & Business Media|isbn=978-0-387-09671-1}}</ref> or '''strip mine and interchange'''.
== Overview ==
Loop tiling partitions a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the [[CPU Cache|cache]] until it is reused. The partitioning of loop iteration space leads to partitioning of a large array into smaller blocks, thus fitting accessed array elements into cache size, enhancing cache reuse and eliminating cache size requirements.
An ordinary loop
<syntaxhighlight lang="c">
Line 41 ⟶ 40:
</syntaxhighlight>
After
<syntaxhighlight lang="c">
Line 67 ⟶ 66:
Many large mathematical operations on computers end up spending much of their time doing [[matrix multiplication]]. The operation is:
: C =
where A, B, and C are
The basic loop is:
Line 92 ⟶ 90:
* Floating point additions take some number of cycles to complete. In order to keep an [[adder (electronics)|adder]] with multiple cycle latency busy, the code must update multiple [[Accumulator (computing)|accumulators]] in parallel.
* Machines can typically do just one memory operation per [[multiply–add|multiply-add]], so values loaded must be reused at least twice.
* Typical PC memory systems can only sustain one 8-byte doubleword per 10–30 double-precision multiply–adds, so values loaded into the cache must be reused many times.
Line 118 ⟶ 116:
</syntaxhighlight>
This code has had both the <code>i</code> and <code>j</code> iterations blocked by a factor of two
This code would run quite acceptably on a Cray Y-MP (built in the early 1980s), which can sustain 0.8 multiply–adds per memory operation to main memory. A machine like a 2.8 GHz Pentium 4, built in 2003, has slightly less memory bandwidth and vastly better floating point, so that it can sustain 16.5 multiply–adds per memory operation. As a result, the code above will run slower on the 2.8 GHz Pentium 4 than on the 166 MHz Y-MP!
A machine with a longer floating-point add latency or with multiple adders would require more accumulators to run in parallel. It is easy to change the loop above to compute a 3x3 block instead of a 2x2 block, but the resulting code is not always faster. The loop requires registers to hold both the accumulators and the loaded and reused A and B values. A 2x2 block requires 7 registers. A 3x3 block requires 13, which will not work on a machine with just 8 floating point registers in the [[Instruction set|ISA]]. If the CPU does not have enough registers, the compiler will schedule extra loads and stores to spill the registers into stack slots, which will make the loop run slower than a smaller blocked loop.
Matrix multiplication is like many other codes in that it can be limited by memory bandwidth, and that more registers can help the compiler and programmer reduce the need for memory bandwidth. This ''[[register pressure]]'' is why vendors of [[RISC]] CPUs, who intended to build machines more parallel than the general purpose [[x86]] and [[68000]] CPUs, adopted 32-entry floating-point [[register file]]s.
Line 129 ⟶ 126:
The code above does not use the cache very well. During the calculation of a horizontal stripe of C results, one horizontal stripe of A is loaded, and the entire matrix B is loaded. For the entire calculation, C is stored once (that's good), A is loaded into the cache once (assuming a stripe of A fits in the cache with a stripe of B), but B is loaded N/ib times, where ib is the size of the strip in the C matrix, for a total of N<sup>3</sup>/ib doubleword loads from main memory. In the code above, ''ib'' is 2.
The next step to
<syntaxhighlight lang="c">
Line 155 ⟶ 152:
</syntaxhighlight>
With this code,
So what size matrix fits?
That trick is reducing the size of the stripe of the B matrix by blocking the k loop so that the stripe is of size ib × kb. Blocking the k loop means that the C array will be loaded and stored N/kb times, for a total of {{tmath|2*N^3/kb}} memory transfers. B is still transferred N/ib times, for {{tmath|N^3/ib}} transfers. So long as
: 2*N/kb + N/ib < N/balance
the machine's memory system will keep up with the floating point unit and the code will run at maximum performance. The 16KB cache of the Pentium 4 is not quite big enough: if ib=24 and kb=64 were chosen instead, 12KB of the cache would be used—avoiding completely filling it, which is desirable so the C and B arrays have to have some room to flow through. These numbers come within 20% of the peak floating-point speed of the processor.
Here is the code with loop <code>k</code> blocked.
Line 209 ⟶ 197:
</syntaxhighlight>
The above code examples do not show the details of dealing with values of N which are not multiples of the blocking factors. Compilers which do loop nest optimization emit code to clean up the edges of the computation. For example, most LNO compilers would probably [[loop splitting|split]] the kk == 0 iteration off from the rest of the <code>kk</code> iterations,
The above loop will only achieve 80% of peak flops on the example system when blocked for the 16KB L1 cache size. It will do worse on systems with even more unbalanced memory systems. Fortunately, the Pentium 4 has 256KB (or more, depending on the model) high-bandwidth level-2 cache as well as the level-1 cache.
*
*
Rather than specifically tune for one particular cache size, as in the first example, a [[cache-oblivious algorithm]] is designed to take advantage of any available cache, no matter what its size is. This automatically takes advantage of two or more levels of memory hierarchy, if available. Cache-oblivious algorithms for [[Cache-oblivious matrix multiplication|matrix multiplication]] are known.
==See also==
Line 225 ⟶ 211:
== References ==
{{Reflist}}
== Further reading ==
Line 231 ⟶ 217:
# Wolf, M. E. and Lam, M. '' [https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15745-s05/www/lectures/wolf-lam-pldi91.pdf A Data Locality Optimizing Algorithm]''. [[PLDI]]'91, pages 30–44, 1991.
# Irigoin, F. and Triolet, R. ''[http://www.cri.ensmp.fr/classement/doc/E-090.pdf Supernode Partitioning]''. [[POPL]]'88, pages 319–329, 1988.
# Xue, J. ''[https://books.google.com/books?id=fp_xBwAAQBAJ&
# M. S. Lam, E. E. Rothberg, and M. E. Wolf. [http://people.eecs.berkeley.edu/~mme/cs267-2016/hw1/papers/p63-lam.pdf The cache performance and optimizations of blocked algorithms]. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63–74, April 1991.
Line 239 ⟶ 225:
{{Compiler optimizations}}
{{Authority control}}
[[Category:Compiler optimizations]]
|