Knuth–Morris–Pratt algorithm: Difference between revisions

Content deleted Content added
undid myself, not sure I really understand if it's O(n) or O(n+k)
m clean up, typo(s) fixed: Therefore → Therefore, (4) using AWB
Line 173:
 
===Worked example of the table-building algorithm===
We consider the example of <code>W = "ABCDABD"</code> first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. We set <code>T[0] = -1</code>. To find <code>T[1]</code>, we must discover a [[Substring#Suffix|proper suffix]] of <code>"A"</code> which is also a prefix of pattern <code>W</code>. But there are no proper suffixes of <code>"A"</code>, so we set <code>T[1] = 0</code>. To find <code>T[2]</code>, we see that the substring <code>W[0]</code> - <code>W[1]</code> (<code>"AB"</code>) has a proper suffix <code>"B"</code>. However "B" is not a prefix of the pattern <code>W</code>. Therefore, we set <code>T[2] = 0</code>.
 
Continuing to <code>T[3]</code>, we first check the proper suffix of length 1, and as in the previous case it fails. Should we also check longer suffixes? No, we now note that there is a shortcut to checking ''all'' suffixes: let us say that we discovered a [[Substring#Suffix|proper suffix]] which is a [[Substring#Prefix|proper prefix]] (A proper prefix of a string is not equal to the string itself) and ending at <code>W[2]</code> with length 2 (the maximum possible); then its first character is also a proper prefix of <code>W</code>, hence a proper prefix itself, and it ends at <code>W[1]</code>, which we already determined did not occur as <code>T[2] = 0</code> and not <code>T[2] = 1</code>. Hence at each stage, the shortcut rule is that one needs to consider checking suffixes of a given size m+1 only if a valid suffix of size m was found at the previous stage (i.e. <code>T[x] = m</code>) and should not bother to check m+2, m+3, etc.
 
Therefore, we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so <code>T[3] = 0</code>.
 
We pass to the subsequent <code>W[4]</code>, <code>'A'</code>. The same logic shows that the longest substring we need consider has length 1 but in this case <code>'A'</code> ''does'' work (corresponding to <code>W[4]</code> <code>S</code> character not <code>'A'</code>); hence <code>T[4] = -1</code>. This means that we may shift pattern <code>W</code> by match length plus one character.
Line 185:
Finally, we see that the next character in the ongoing segment starting at <code>W[4] = 'A'</code> would be <code>'B'</code>, and indeed this is also <code>W[5]</code>. Furthermore, the same argument as above shows that we need not look before <code>W[4]</code> to find a segment for <code>W[6]</code>, so that this is it, and we take <code>T[6] = 2</code>.
 
Therefore, we compile the following table:
 
{| class="wikitable" style="background-color:white; font-family:monospace; text-align:right"
Line 367:
 
===Efficiency of the table-building algorithm===
The complexity of the table algorithm is <code>O(k)</code>, where <code>k</code> is the length of <code>W</code>. As except for some initialization all the work is done in the <code>'''while'''</code> loop, it is sufficient to show that this loop executes in <code>O(k)</code> time, which will be done by simultaneously examining the quantities <code>pos</code> and <code>pos - cnd</code>. In the first branch, <code>pos - cnd</code> is preserved, as both <code>pos</code> and <code>cnd</code> are incremented simultaneously, but naturally, <code>pos</code> is increased. In the second branch, <code>cnd</code> is replaced by <code>T[cnd]</code>, which we saw [[#Efficiency of the search algorithm|above]] is always strictly less than <code>cnd</code>, thus increasing <code>pos - cnd</code>. Since <code>pos &ge; pos - cnd</code>, this means that at each stage either <code>pos</code> or a lower bound for <code>pos</code> increases; therefore since the algorithm terminates once <code>pos = k</code>, it must terminate after at most <code>2k</code> iterations of the loop, since <code>pos - cnd</code> begins at <code>1</code>. Therefore, the complexity of the table algorithm is <code>O(k)</code>.
 
==Efficiency of the KMP algorithm==