Knuth–Morris–Pratt algorithm: Difference between revisions

Content deleted Content added
Ryan Reich (talk | contribs)
m Fix typesetting changed in previous modification
Ryan Reich (talk | contribs)
Shorten the verbal algorithm, reorder the efficiency calculation and the code example, and rewrite the calculation.
Line 33:
i: 0123456
 
Once again we immediately hit upon a match 'ABCDAB' but the next character, 'C', does not match the final character 'D' of the pattern. Reasoning as before, we set <math>m = 1715</math>, to start at the two-character string 'AB' leading up to the current position, set <math>i = 2</math>, and continue matching from the current position.
 
m: 01234567890123456789012
Line 40:
i: 0123456
 
This time we are able to complete the match, and return the position 1715 as its origin.
 
===The search algorithm===
Line 47:
 
# Let <math>i = m = 0</math>; say <math>P</math> has <math>n</math> characters, and <math>S</math>, <math>l</math> characters.
# If <math>m + i = l</math>, then we exit; there is no match. Otherwise, compare <math>P[i]</math> versus <math>S[m + i]</math>. At this point we have already matched <math>i</math> previous characters.
#* If they are equal, set <math>i = i + 1</math>. If <math>i = n</math> then we have found a full match; terminate the algorithm and return <math>m</math> as the start of the match.
#* If they are unequal, let <math>e = T[i - 1]</math> be the entry in the "partial match" table described below at position <math>i - 1</math>. Set <math>m = m + i - e</math> and if <math>i > 0</math>, set <math>i = e</math>.
# Return to step 2.
 
This clearly implements the algorithm performed in the example; at each mismatch, we consult the table for the next known possible beginning of a new match and update our counters accordingly. Thus we need never backtrack; in particular, each character is matched only once (though potentially ''mis''matched many times; an analysis of the efficiency of this algorithm is given below).
 
===The efficiency of the search algorithm===
 
Assuming the prior existence of the table <math>T</math>, the search portion of the Knuth-Morris-Pratt algorithm has [[complexity]] [[Big-O notation|O]]<math>(l)</math>, where <math>l</math> is the length of <math>S</math>. We compute this by counting the number of times each step can execute. The crucial stages are the two branches of step 2; the total number of times they execute is the number of times step 2 itself executes. The important fact about the second branch is that the value of <math>m</math> is guaranteed to increase by at least 1. Since we set <math>m = m + (i - e)</math>, <math>m</math> increases if and only if <math>e = T[i - 1] < i</math>. By definition, <math>T[i - 1]</math> is the length of a ''proper'' terminal substring of the <math>i</math> characters <math>P[0], \dots, P[i - 1]</math>; hence <math>T[i - 1]</math> is at most <math>i - 1 < i</math>. This argument holds when <math>i > 0</math>; when <math>i = 0</math> we have <math>T[i - 1] = -1 < 0 = i</math>, so the inequality again holds. Therefore <math>m</math> does indeed increase in the second branch, so this branch is executed at most <math>l</math> times, as the algorithm terminates when <math>m + i = l</math>.
 
The second branch of step 2 executes once each time a comparison is not a match; the first branch, likewise, executes once for each match. It is clear from the operations performed in the second branch that <math>m + i</math> is nondecreasing with successive iterations of step 2; therefore, since <math>i</math> actually increases each time the first branch executes, the index <math>m + i</math> of the character currently under scrutiny increases between successive iterations of the first branch. Hence a character which has once been matched is never subjected to comparison again. It follows that the first branch also executes at most once for each character of <math>S</math>, hence no more than <math>l</math> times. So step 2 executes at most <math>2l</math> times. As step 1 executes merely once, and step 3 executes each time step 2 does except when the algorithm exits, the number of steps performed by the algorithm in searching <math>S</math> is at most <math>4l</math>; hence the search algorithm exits in <math>O(l)</math> time.
 
===A code example of the search algorithm===
Line 84 ⟶ 78:
}
}
 
===The efficiency of the search algorithm===
 
Assuming the prior existence of the table <math>T</math>, the search portion of the Knuth-Morris-Pratt algorithm has [[complexity]] [[Big-O notation|O]]<math>(l)</math>, where <math>l</math> is the length of <math>S</math>. As except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in the central loop, we will calculate a bound on the number of iterations of this loop; in order to do this we first make a key observation about the nature of <math>T</math>. By definition it is constructed so that if a match which had begun at <math>S[m]</math> fails while comparing <math>S[m + i]</math> to <math>P[i]</math>, then the next possible match must begin at <math>S[m + (i - T[i])]</math>. In particular the next possible match must occur at a higher index than <math>m</math>, so that <math>T[i] < i</math>.
 
Using this fact, we will show that the loop can execute at most <math>l</math> times. For in each iteration, it executes one of the two branches of the <tt>if</tt> statement. The first branch invariably increases <math>i</math> and does not change <math>m</math>, so that the index <math>m + i</math> of the currently scrutinized character of <math>S</math> is increased. The second branch adds <math>i - T[i]</math> to <math>m</math>, and as we have seen, this is always a positive number. Thus the ___location <math>m</math> of the beginning of the current potential match is increased. Now, the loop ends if <math>S[m + i] = '\backslash 0'</math>, which following the C convention that a null character denotes the end-of-string, means that <math>m + i = l</math>. Therefore each branch of the <tt>if</tt> statement can be reached at most <math>l</math> times, since they respectively increase either <math>m + i</math> or <math>m</math>, and <math>m \leq m + i</math> so that if <math>m = l</math>, then certainly <math>m + i \geq l</math>, so that since it increases by unit increments at most, we must have had <math>m + i = l</math> at some point in the past.
 
Thus the loop executes at most <math>2l</math> times, showing that the time complexity of the search algorithm is <math>O(l)</math>.
 
==The "partial match" table==