Content deleted Content added
m open/close tags switched |
Ryan Reich (talk | contribs) Tighten up the search algorithm and add a section on the time complexity. |
||
Line 5:
==An example of the algorithm==
To motivate the technical details, we consider a specific instance. In this article, we will be using [[Array#Indices into Arrays|zero-based arrays]] for our strings; thus the 'C' in <math>P</math> will be denoted <math>P[2]</math>. Let <math>m</math> be the start of the currently matched substring within <math>S</math> and <math>i</math> the position within <math>P</math>.
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
i: 0123456
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
i: 0123456
We quickly obtain a nearly complete match 'ABCDAB' when, at <math>i = 6
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
i: 0123456
This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we increment <math>m = 11</math>, reset <math>i = 0</math>, and begin searching anew.
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
i: 0123456
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
i: 0123456
This time we are able to complete the match, and return the position 17 as its origin.
▲This search fails immediately, however, as the pattern still does not contain a ' ', so as in the first trial, we increment <math>m = 11</math> and reset <math>i = 0</math>, <math>j = 11</math>, and begin searching anew. Once again we immediately hit upon a match 'ABCDAB' but the next character, 'C', does not match the final character 'D' of the pattern. Reasoning as before, we set <math>m = 17</math>, to start at the two-character string 'AB' leading up to the current position, set <math>i = 2</math>, and continue matching from the current position. This time we are able to complete the match, and return the position 17 as its origin.
==The searching algorithm==
Line 22 ⟶ 46:
The above example is completely instructive in this regard. Once the table is compiled the search is trivial and can be done without ever repeatedly matching any character of the main string. It proceeds as follows:
# Let <math>i
#
#* If they are equal, set <math>i = i
#* If they are unequal, let <math>e = T[i - 1]</math> be
# Return to step 2.
Line 88 ⟶ 112:
The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. Here follows the algorithm.
# Set <math>T[-1] = T[0] = -1</math>. This takes care of special conditions; we cannot compute <math>T[-1]</math> by searching for substrings, and our algorithm would give the incorrect answer 0 for <math>T[0]</math>
# Set <math>i = 1</math>, <math>j = -1</math>.
# If <math>i = n</math> then we are done; terminate the algorithm. Otherwise,
#* If they are equal, set <math>T[i] = j + 1</math>, <math>j = j + 1</math>, and <math>i = i + 1</math>.
#* If not, and if <math>j =
#* If not, and if <math>j \geq 0</math>, set <math>j = T[j]</math>.
# Return to step 3.
==Algorithmic complexity==
Let <math>n</math> be the length of <math>P</math> and <math>l</math> be the length of <math>S</math>. Then the [[complexity]] of the search algorithm is <math>O(l)</math> and the complexity of the table algorithm is <math>O(n)</math>.
We compute the complexity of the search algorithm by counting the number of times each step can execute. The crucial stages are the two branches of step 2; the total number of times they execute is the number of times step 2 itself executes. The important fact about the second branch is that the value of <math>m</math> is guaranteed to increase by at least 1. This is clear if <math>T[i - 1] = -1</math>. In the case that <math>T[i - 1] \geq 0</math>, <math>m</math> increases if and only if <math>T[i - 1] < i</math>. By definition, <math>T[i - 1]</math> is the length of a ''proper'' terminal substring of the substring <math>P[0], \dots, P[i - 1]</math>, which has length <math>i</math>; hence any proper substring must have length at most <math>i - 1 < i</math>. Therefore <math>m</math> does indeed increase in the second branch, so this branch is executed at most <math>l</math> times, as the algorithm terminates when <math>m + i = l</math>.
The second branch of step 2 executes once each time a comparison is not a match; the first branch, likewise, executes once for each match. It is clear from the operations performed in the second branch that <math>m + i</math> is nondecreasing with successive iterations of step 2; therefore, since <math>i</math> actually increases each time the first branch executes, the index <math>m + i</math> of the character currently under scrutiny increases between successive iterations of the first branch. Hence a character which has once been matched is never subjected to comparison again. It follows that the first branch also executes at most once for each character of <math>S</math>, hence no more than <math>l</math> times. So step 2 executes at most <math>2l</math> times. As step 1 executes merely once, and step 3 executes each time step 2 does except when the algorithm exits, the number of steps performed by the algorithm in searching <math>S</math> is at most <math>4l</math>; hence the search algorithm exits in <math>O(l)</math> time.
The complexity of the table algorithm is <math>O(n)</math>. As above, it is sufficient to show that step 3 executes in <math>O(n)</math> time, which will be done by examining the quantity <math>i - j - 1</math>. In the first branch, this quantity is preserved, as both <math>i</math> and <math>j</math> are incremented simultaneously. In the second branch, <math>i</math> is incremented and <math>j</math> is not, so <math>i - j - 1</math> increases. In the third branch, <math>j</math> is replaced by <math>T[j]</math>, which we saw above is always strictly less than <math>j</math>. Therefore <math>i - j - 1</math> always increases. Since <math>i \geq i - j - 1</math> (because <math>j \geq -1</math>), and since the algorithm terminates once <math>i = n</math>, this implies that it terminates after fewer than <math>n</math> iterations of step 3, since <math>i - j - 1</math> begins at <math>1</math>. Therefore the complexity of the table algorithm is <math>O(n)</math>.
The two complexities combined give an overall complexity of <math>O(n + l)</math> for the Knuth-Morris-Pratt algorithm.
==External links==
|