Knuth–Morris–Pratt algorithm: Difference between revisions

Content deleted Content added
Tag: section blanking
Tag: section blanking
Line 67:
</code>
This time we are able to complete the match, whose first character is <code>S[15]</code>.
 
===Efficiency of the search algorithm===
Assuming the prior existence of the table <code>T</code>, the search portion of the Knuth–Morris–Pratt algorithm has [[Computational complexity theory|complexity]] [[Linear_time#Linear_time|O(n)]], where <code>n</code> is the length of <code>S</code> and the <code>O</code> is [[big-O notation]]. Except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in the <code>'''while'''</code> loop. To bound the number of iterations of this loop; observe that <code>T</code> is constructed so that if a match which had begun at <code>S[m]</code> fails while comparing <code>S[m + i]</code> to <code>W[i]</code>, then the next possible match must begin at <code>S[m + (i - T[i])]</code>. In particular the next possible match must occur at a higher index than <code>m</code>, so that <code>T[i] < i</code>.
 
This fact implies that the loop can execute at most <code>2n</code> times. For, in each iteration, it executes one of the two branches in the loop. The first branch invariably increases <code>i</code> and does not change <code>m</code>, so that the index <code>m + i</code> of the currently scrutinized character of <code>S</code> is increased. The second branch adds <code>i - T[i]</code> to <code>m</code>, and as we have seen, this is always a positive number. Thus the ___location <code>m</code> of the beginning of the current potential match is increased. Now, the loop ends if <code>m + i = n</code>; therefore each branch of the loop can be reached at most <code>k</code> times, since they respectively increase either <code>m + i</code> or <code>m</code>, and <code>m &le; m + i</code>: if <code>m = n</code>, then certainly <code>m + i &ge; n</code>, so that since it increases by unit increments at most, we must have had <code>m + i = n</code> at some point in the past, and therefore either way we would be done.
 
Thus the loop executes at most <code>2n</code> times, showing that the time complexity of the search algorithm is <code>O(n)</code>.
 
Here is another way to think about the runtime:
Let us say we begin to match W and S at position i and p, if W exists as a substring of S at p, then W[0 through m] == S[p through p+m].
Upon success, that is, the word and the text matched at the positions(W[i] == S[p+i]), we increase i by 1 (i++).
Upon failure, that is, the word and the text does not match at the positions(W[i] != S[p+i]), the text pointer is kept still, while the word pointer roll-back a certain amount(i = T[i], where T is the jump table) And we attempt to match <code>W[T[i]]</code> with <code>S[p+i]</code>.
The maximum number of roll-back of i is bounded by i, that is to say, for any failure, we can only roll-back as much as we have progressed up to the failure.
Then it is clear the runtime is 2n.
 
=="Partial match" table (also known as "failure function")==