Revision as of 07:52, 12 November 2023 edit GhostInTheMachine (talk \| contribs) Extended confirmed users, Page movers 107,150 edits Changing short description from "loop blocking" to "Technique in computer software design" Tag: Shortdesc helper ← Previous edit		Latest revision as of 17:19, 29 August 2024 edit undo Max sang (talk \| contribs) 139 edits m typo
Line 118: This code has had both the <code>i</code> and <code>j</code> iterations blocked by a factor of two and had both the resulting two-iteration inner loops completely unrolled. This code would run quite acceptably on a Cray Y-MP (built in the early 1980s), which can sustain 0.8 multiply–adds per memory operation to main memory. A machine like a 2.8 GHz Pentium 4, ~~build~~built in 2003, has slightly less memory bandwidth and vastly better floating point, so that it can sustain 16.5 multiply–adds per memory operation. As a result, the code above will run slower on the 2.8 GHz Pentium 4 than on the 166 MHz Y-MP! A machine with a longer floating-point add latency or with multiple adders would require more accumulators to run in parallel. It is easy to change the loop above to compute a 3x3 block instead of a 2x2 block, but the resulting code is not always faster. The loop requires registers to hold both the accumulators and the loaded and reused A and B values. A 2x2 block requires 7 registers. A 3x3 block requires 13, which will not work on a machine with just 8 floating point registers in the [[Instruction set\|ISA]]. If the CPU does not have enough registers, the compiler will schedule extra loads and stores to spill the registers into stack slots, which will make the loop run slower than a smaller blocked loop.

Loop nest optimization: Difference between revisions