Loop nest optimization: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 18:49, 25 April 2021 edit Beland (talk \| contribs) Autopatrolled, Administrators 259,169 edits m convert special characters (via WP:JWB) ← Previous edit		Latest revision as of 17:19, 29 August 2024 edit undo Max sang (talk \| contribs) 139 edits m typo
(5 intermediate revisions by 4 users not shown)
Line 1: {{Short description\|Technique in computer software design}} {{More citations needed\|date=January 2021}} In [[computer science]] and particularly in [[compiler]] design, '''loop nest optimization''' (LNO) is an optimization technique that applies a set of [[loop transformation]]s for the purpose of [[Locality of reference\|locality]] optimization or parallelization or another loop overhead reduction of the loop nests. ([[Nested loop]]s occur when one loop is inside of another loop.) One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common [[linear algebra]] [[algorithm]]s. The technique used to produce this optimization is called '''loop tiling''',<ref name="MuchnickAssociates1997">{{cite book\|author1=Steven Muchnick\|author2=Muchnick and Associates\|title=Advanced Compiler Design Implementation\|url=https://archive.org/details/advancedcompiler00much\|url-access=registration\|quote=tiling.\|date=15 August 1997\|publisher=Morgan Kaufmann\|isbn=978-1-55860-320-2}}</ref> also known as '''loop blocking'''<ref name="CardosoDiniz2011">{{cite book\|author1=João M.P. Cardoso\|author2=Pedro C. Diniz\|title=Compilation Techniques for Reconfigurable Architectures\|url=https://books.google.com/books?id=4xwWNCiF9CgC&q=%22loop+tiling%22\|date=2 April 2011\|publisher=Springer Science & Business Media\|isbn=978-0-387-09671-1}}</ref> or '''strip mine and interchange'''. Line 117 ⟶ 118: This code has had both the <code>i</code> and <code>j</code> iterations blocked by a factor of two and had both the resulting two-iteration inner loops completely unrolled. This code would run quite acceptably on a Cray Y-MP (built in the early 1980s), which can sustain 0.8 multiply–adds per memory operation to main memory. A machine like a 2.8 GHz Pentium 4, ~~build~~built in 2003, has slightly less memory bandwidth and vastly better floating point, so that it can sustain 16.5 multiply–adds per memory operation. As a result, the code above will run slower on the 2.8 GHz Pentium 4 than on the 166 MHz Y-MP! A machine with a longer floating-point add latency or with multiple adders would require more accumulators to run in parallel. It is easy to change the loop above to compute a 3x3 block instead of a 2x2 block, but the resulting code is not always faster. The loop requires registers to hold both the accumulators and the loaded and reused A and B values. A 2x2 block requires 7 registers. A 3x3 block requires 13, which will not work on a machine with just 8 floating point registers in the [[Instruction set\|ISA]]. If the CPU does not have enough registers, the compiler will schedule extra loads and stores to spill the registers into stack slots, which will make the loop run slower than a smaller blocked loop. Line 216 ⟶ 217: # Wolf, M. E. and Lam, M. '' [https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15745-s05/www/lectures/wolf-lam-pldi91.pdf A Data Locality Optimizing Algorithm]''. [[PLDI]]'91, pages 30–44, 1991. # Irigoin, F. and Triolet, R. ''[http://www.cri.ensmp.fr/classement/doc/E-090.pdf Supernode Partitioning]''. [[POPL]]'88, pages 319–329, 1988. # Xue, J. ''[https://books.google.com/books?id=fp_xBwAAQBAJ&~~printsec=frontcover&dq~~q=%22Loop+Tiling+for+Parallelism%22~~&hl=en&sa=X&ved=0ahUKEwi715iZ6ofjAhWRPM0KHQQdBsYQ6AEIKjAA#v=onepage&q=%22Loop%20Tiling%20for%20Parallelism%22&f=false~~ Loop Tiling for Parallelism]''. Kluwer Academic Publishers. 2000. # M. S. Lam, E. E. Rothberg, and M. E. Wolf. [http://people.eecs.berkeley.edu/~mme/cs267-2016/hw1/papers/p63-lam.pdf The cache performance and optimizations of blocked algorithms]. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63–74, April 1991.