Loop nest optimization: Difference between revisions

Content deleted Content added
BG19bot (talk | contribs)
m Remove blank line(s) between list items per WP:LISTGAP to fix an accessibility issue for users of screen readers. Do WP:GENFIXES and cleanup if needed. Discuss this at Wikipedia talk:WikiProject Accessibility#LISTGAP
Example: Matrix multiply: Wording cleanup (could use a lot more work)
Line 7:
In [[computer science]] and particularly in [[compiler]] design, '''loop nest optimization''' (LNO) is an optimization technique that applies a set of [[loop transformation]]s for the purpose of [[Locality of reference|locality]] optimization or parallelization or other loop overhead reduction of the loop nests. One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common linear algebra [[algorithm]]s.
 
==Example: Matrixmatrix multiplymultiplication==
Many large mathematical operations on computers end up spending much of their time doing [[matrix multiplication]]. The operation is:
 
Line 72:
The code above does not use the cache very well. During the calculation of a horizontal stripe of C results, one horizontal stripe of B is loaded and the entire matrix A is loaded. For the entire calculation, C is stored once (that's good), B is loaded into the cache once (assuming a stripe of B fits in the cache with a stripe of A), but A is loaded N/ib times, where ib is the size of the strip in the C matrix, for a total of N<sup>3</sup>/ib doubleword loads from main memory. In the code above, ''ib'' is 2.
 
The next step to reducing the memory traffic is to make ib as large as possible. It needs to be larger than the "balance" number reported by streams. In the case of one particular 2.8&nbsp;GHz Pentium&nbsp;4 system used for this example, the balance number is 16.5. The second code example above cannot be extended directly, since that would require many more accumulator registers. Instead, the loop is ''blocked'' over i. (Technically, this is actually the second time i is blocked, as the first time was the factor of 2.)
The next step to reducing the memory traffic is to make ib as large as
possible. We want it to be larger than the "balance" number reported
by streams. In the case of one particular 2.8&nbsp;GHz Pentium-4 system
used for this example, the balance number is 16.5. The
second code example above can't be extended directly, since
that would require many more accumulator registers. Instead, we ''block''
the loop over i. (Technically, this is actually the second time we've
blocked i, as the first time was the factor of 2.)
 
<source lang="c">