Revision as of 18:50, 10 March 2016 edit BG19bot (talk \| contribs) 1,005,055 edits m Remove blank line(s) between list items per WP:LISTGAP to fix an accessibility issue for users of screen readers. Do WP:GENFIXES and cleanup if needed. Discuss this at Wikipedia talk:WikiProject Accessibility#LISTGAP ← Previous edit		Revision as of 15:06, 5 April 2016 edit undo Dsimic (talk \| contribs) Extended confirmed users, Pending changes reviewers, Rollbackers 39,664 edits →Example: Matrix multiply: Wording cleanup (could use a lot more work) Next edit →
Line 7: In [[computer science]] and particularly in [[compiler]] design, '''loop nest optimization''' (LNO) is an optimization technique that applies a set of [[loop transformation]]s for the purpose of [[Locality of reference\|locality]] optimization or parallelization or other loop overhead reduction of the loop nests. One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common linear algebra [[algorithm]]s. ==Example: ~~Matrix~~matrix ~~multiply~~multiplication== Many large mathematical operations on computers end up spending much of their time doing [[matrix multiplication]]. The operation is: Line 72: The code above does not use the cache very well. During the calculation of a horizontal stripe of C results, one horizontal stripe of B is loaded and the entire matrix A is loaded. For the entire calculation, C is stored once (that's good), B is loaded into the cache once (assuming a stripe of B fits in the cache with a stripe of A), but A is loaded N/ib times, where ib is the size of the strip in the C matrix, for a total of N<sup>3</sup>/ib doubleword loads from main memory. In the code above, ''ib'' is 2. The next step to reducing the memory traffic is to make ib as large as possible. It needs to be larger than the "balance" number reported by streams. In the case of one particular 2.8 GHz Pentium 4 system used for this example, the balance number is 16.5. The second code example above cannot be extended directly, since that would require many more accumulator registers. Instead, the loop is ''blocked'' over i. (Technically, this is actually the second time i is blocked, as the first time was the factor of 2.) ~~The next step to reducing the memory traffic is to make ib as large as~~ ~~possible. We want it to be larger than the "balance" number reported~~ ~~by streams. In the case of one particular 2.8 GHz Pentium-4 system~~ ~~used for this example, the balance number is 16.5. The~~ ~~second code example above can't be extended directly, since~~ ~~that would require many more accumulator registers. Instead, we ''block''~~ ~~the loop over i. (Technically, this is actually the second time we've~~ ~~blocked i, as the first time was the factor of 2.)~~ <source lang="c">

Loop nest optimization: Difference between revisions