Loop nest optimization: Difference between revisions

Content deleted Content added
m convert special characters (via WP:JWB)
Max sang (talk | contribs)
m typo
 
(5 intermediate revisions by 4 users not shown)
Line 1:
{{Short description|Technique in computer software design}}
{{More citations needed|date=January 2021}}
 
In [[computer science]] and particularly in [[compiler]] design, '''loop nest optimization''' (LNO) is an optimization technique that applies a set of [[loop transformation]]s for the purpose of [[Locality of reference|locality]] optimization or parallelization or another loop overhead reduction of the loop nests. ([[Nested loop]]s occur when one loop is inside of another loop.) One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common [[linear algebra]] [[algorithm]]s.
 
The technique used to produce this optimization is called '''loop tiling''',<ref name="MuchnickAssociates1997">{{cite book|author1=Steven Muchnick|author2=Muchnick and Associates|title=Advanced Compiler Design Implementation|url=https://archive.org/details/advancedcompiler00much|url-access=registration|quote=tiling.|date=15 August 1997|publisher=Morgan Kaufmann|isbn=978-1-55860-320-2}}</ref> also known as '''loop blocking'''<ref name="CardosoDiniz2011">{{cite book|author1=João M.P. Cardoso|author2=Pedro C. Diniz|title=Compilation Techniques for Reconfigurable Architectures|url=https://books.google.com/books?id=4xwWNCiF9CgC&q=%22loop+tiling%22|date=2 April 2011|publisher=Springer Science & Business Media|isbn=978-0-387-09671-1}}</ref> or '''strip mine and interchange'''.
Line 117 ⟶ 118:
This code has had both the <code>i</code> and <code>j</code> iterations blocked by a factor of two and had both the resulting two-iteration inner loops completely unrolled.
 
This code would run quite acceptably on a Cray Y-MP (built in the early 1980s), which can sustain 0.8&nbsp;multiply–adds per memory operation to main memory. A machine like a 2.8&nbsp;GHz Pentium&nbsp;4, buildbuilt in 2003, has slightly less memory bandwidth and vastly better floating point, so that it can sustain 16.5&nbsp;multiply–adds per memory operation. As a result, the code above will run slower on the 2.8&nbsp;GHz Pentium&nbsp;4 than on the 166&nbsp;MHz Y-MP!
 
A machine with a longer floating-point add latency or with multiple adders would require more accumulators to run in parallel. It is easy to change the loop above to compute a 3x3 block instead of a 2x2 block, but the resulting code is not always faster. The loop requires registers to hold both the accumulators and the loaded and reused A and B values. A 2x2 block requires 7 registers. A 3x3 block requires 13, which will not work on a machine with just 8 floating point registers in the [[Instruction set|ISA]]. If the CPU does not have enough registers, the compiler will schedule extra loads and stores to spill the registers into stack slots, which will make the loop run slower than a smaller blocked loop.
Line 216 ⟶ 217:
# Wolf, M. E. and Lam, M. '' [https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15745-s05/www/lectures/wolf-lam-pldi91.pdf A Data Locality Optimizing Algorithm]''. [[PLDI]]'91, pages 30–44, 1991.
# Irigoin, F. and Triolet, R. ''[http://www.cri.ensmp.fr/classement/doc/E-090.pdf Supernode Partitioning]''. [[POPL]]'88, pages 319–329, 1988.
# Xue, J. ''[https://books.google.com/books?id=fp_xBwAAQBAJ&printsec=frontcover&dqq=%22Loop+Tiling+for+Parallelism%22&hl=en&sa=X&ved=0ahUKEwi715iZ6ofjAhWRPM0KHQQdBsYQ6AEIKjAA#v=onepage&q=%22Loop%20Tiling%20for%20Parallelism%22&f=false Loop Tiling for Parallelism]''. Kluwer Academic Publishers. 2000.
# M. S. Lam, E. E. Rothberg, and M. E. Wolf. [http://people.eecs.berkeley.edu/~mme/cs267-2016/hw1/papers/p63-lam.pdf The cache performance and optimizations of blocked algorithms]. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63–74, April 1991.