Content deleted Content added
Ryan Reich (talk | contribs) m Fix typesetting changed in previous modification |
Ryan Reich (talk | contribs) Shorten the verbal algorithm, reorder the efficiency calculation and the code example, and rewrite the calculation. |
||
Line 33:
i: 0123456
Once again we immediately hit upon a match 'ABCDAB' but the next character, 'C', does not match the final character 'D' of the pattern. Reasoning as before, we set <math>m =
m: 01234567890123456789012
Line 40:
i: 0123456
This time we are able to complete the match, and return the position
===The search algorithm===
Line 47:
# Let <math>i = m = 0</math>; say <math>P</math> has <math>n</math> characters, and <math>S</math>, <math>l</math> characters.
# If <math>m + i = l</math>, then we exit; there is no match. Otherwise, compare <math>P[i]</math> versus <math>S[m + i]</math>
#* If they are equal, set <math>i = i + 1</math>. If <math>i = n</math> then we have found a full match; terminate the algorithm and return <math>m</math> as the start of the match.
#* If they are unequal, let <math>e = T[i - 1]
# Return to step 2.
This clearly implements the algorithm performed in the example; at each mismatch, we consult the table for the next known possible beginning of a new match and update our counters accordingly. Thus we need never backtrack; in particular, each character is matched only once (though potentially ''mis''matched many times; an analysis of the efficiency of this algorithm is given below).
===The efficiency of the search algorithm===▼
===A code example of the search algorithm===
Line 84 ⟶ 78:
}
}
▲===The efficiency of the search algorithm===
Assuming the prior existence of the table <math>T</math>, the search portion of the Knuth-Morris-Pratt algorithm has [[complexity]] [[Big-O notation|O]]<math>(l)</math>, where <math>l</math> is the length of <math>S</math>. As except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in the central loop, we will calculate a bound on the number of iterations of this loop; in order to do this we first make a key observation about the nature of <math>T</math>. By definition it is constructed so that if a match which had begun at <math>S[m]</math> fails while comparing <math>S[m + i]</math> to <math>P[i]</math>, then the next possible match must begin at <math>S[m + (i - T[i])]</math>. In particular the next possible match must occur at a higher index than <math>m</math>, so that <math>T[i] < i</math>.
Using this fact, we will show that the loop can execute at most <math>l</math> times. For in each iteration, it executes one of the two branches of the <tt>if</tt> statement. The first branch invariably increases <math>i</math> and does not change <math>m</math>, so that the index <math>m + i</math> of the currently scrutinized character of <math>S</math> is increased. The second branch adds <math>i - T[i]</math> to <math>m</math>, and as we have seen, this is always a positive number. Thus the ___location <math>m</math> of the beginning of the current potential match is increased. Now, the loop ends if <math>S[m + i] = '\backslash 0'</math>, which following the C convention that a null character denotes the end-of-string, means that <math>m + i = l</math>. Therefore each branch of the <tt>if</tt> statement can be reached at most <math>l</math> times, since they respectively increase either <math>m + i</math> or <math>m</math>, and <math>m \leq m + i</math> so that if <math>m = l</math>, then certainly <math>m + i \geq l</math>, so that since it increases by unit increments at most, we must have had <math>m + i = l</math> at some point in the past.
Thus the loop executes at most <math>2l</math> times, showing that the time complexity of the search algorithm is <math>O(l)</math>.
==The "partial match" table==
|