Content deleted Content added
m Reverted edits by Shwe pan kone (talk) to last version by Minorax |
→Matrix multiplication: improved explanation of hoisting |
||
Line 70:
</syntaxhighlight>
The reason for this speedup is that in the first case, the reads of <code>A[i][k]</code> are in cache (since the <code>k</code> index is the contiguous, last dimension), but <code>B[k][j]</code> is not, so there is a cache miss penalty on <code>B[k][j]</code>. <code>C[i][j]</code> is irrelevant, because it can be
<syntaxhighlight lang="pascal" line="1">
for i in 0..n
for j in 0..m
temp = C[i][j]
for k in 0..p
temp = temp + A[i][k] * B[k][j];
C[i][j] = temp
</syntaxhighlight>
In the second case, the reads and writes of <code>C[i][j]</code> are both in cache, the reads of <code>B[k][j]</code> are in cache, and the read of <code>A[i][k]</code> can be hoisted out of the inner loop.
<syntaxhighlight lang="pascal" line="1">
for i in 0..n
for k in 0..p
temp = A[i][k]
for j in 0..m
C[i][j] = C[i][j] + temp * B[k][j];
</syntaxhighlight>
Thus, the second example has no cache miss penalty in the inner loop while the first example has a cache penalty.
On a year 2014 processor, the second case is approximately five times faster than the first case, when written in [[C (programming language)|C]] and compiled with <code>gcc -O3</code>. (A careful examination of the disassembled code shows that in the first case, [[GNU Compiler Collection|GCC]] uses [[SIMD]] instructions and in the second case it does not, but the cache penalty is much worse than the SIMD gain.){{Citation needed|date=September 2014}}
|