Loop nest optimization: Difference between revisions

Content deleted Content added
Changing short description from "loop blocking" to "Technique in computer software design"
Max sang (talk | contribs)
m typo
 
Line 118:
This code has had both the <code>i</code> and <code>j</code> iterations blocked by a factor of two and had both the resulting two-iteration inner loops completely unrolled.
 
This code would run quite acceptably on a Cray Y-MP (built in the early 1980s), which can sustain 0.8&nbsp;multiply–adds per memory operation to main memory. A machine like a 2.8&nbsp;GHz Pentium&nbsp;4, buildbuilt in 2003, has slightly less memory bandwidth and vastly better floating point, so that it can sustain 16.5&nbsp;multiply–adds per memory operation. As a result, the code above will run slower on the 2.8&nbsp;GHz Pentium&nbsp;4 than on the 166&nbsp;MHz Y-MP!
 
A machine with a longer floating-point add latency or with multiple adders would require more accumulators to run in parallel. It is easy to change the loop above to compute a 3x3 block instead of a 2x2 block, but the resulting code is not always faster. The loop requires registers to hold both the accumulators and the loaded and reused A and B values. A 2x2 block requires 7 registers. A 3x3 block requires 13, which will not work on a machine with just 8 floating point registers in the [[Instruction set|ISA]]. If the CPU does not have enough registers, the compiler will schedule extra loads and stores to spill the registers into stack slots, which will make the loop run slower than a smaller blocked loop.