Knuth–Morris–Pratt algorithm: Difference between revisions

Content deleted Content added
m open/close tags switched
Ryan Reich (talk | contribs)
Tighten up the search algorithm and add a section on the time complexity.
Line 5:
==An example of the algorithm==
 
To motivate the technical details, we consider a specific instance. In this article, we will be using [[Array#Indices into Arrays|zero-based arrays]] for our strings; thus the 'C' in <math>P</math> will be denoted <math>P[2]</math>. Let <math>m</math> be the start of the currently matched substring within <math>S</math> and <math>i</math> the position within <math>P</math>.
To motivate the technical details, we consider a specific instance.
 
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
i: 0123456
 
InWe thisbegin article,by wematching willthe befirst usingthree [[Array#Indicescharacters intoone Arrays|zero-basedafter arrays]]another. for ourThus strings; thusin the 'C'fourth instep, <math>Pm = 0</math> willand be<math>i denoted= 3</math>P. But <math>S[23]</math>. is Leta space and <math>mP[3] = 'D'</math>, beso thewe starthave ofa themismatch. currently matchedRather substringthan withinbeginning again <math>Sm = 1</math>, <math>i</math>we thenote positionthat withinno 'A' occurs between positions 0 and 3 in <math>P</math> except at 0; hence, andhaving checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again. Therefore we move on to the next character, setting <math>jm = 4</math> the absolute position withinand <math>Si = 0</math>.
 
m: 01234567890123456789012
We begin by matching the first three characters one after another. Thus in the fourth step, <math>m = 0</math>, <math>i = 3</math>, <math>j = 3</math>. But <math>S[3]</math> is a space and <math>P[3] = 'D'</math>, so we have a mismatch. Rather than beginning again <math>m = 1</math>, we note that no 'A' occurs between positions 0 and 3 in <math>P</math> except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again. Therefore we move on to the next character, setting <math>m = 4</math>, <math>i = 0</math>, <math>j = 4</math>.
S: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
i: 0123456
 
We quickly obtain a nearly complete match 'ABCDAB' when, at <math>i = 6</math>, <math>j = 10</math> we again have a discrepancy. However, just prior to the end of the current partial match we passed an 'AB', which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match the two characters prior to the current position, we need not check them again; we simply reset <math>m = 8</math>, <math>i = 2</math> and continue matching the current character.
 
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
i: 0123456
 
This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we increment <math>m = 11</math>, reset <math>i = 0</math>, and begin searching anew.
 
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
i: 0123456
 
This search fails immediately, however, as the pattern still does not contain a ' ', so as in the first trial, we increment <math>m = 11</math> and reset <math>i = 0</math>, <math>j = 11</math>, and begin searching anew. Once again we immediately hit upon a match 'ABCDAB' but the next character, 'C', does not match the final character 'D' of the pattern. Reasoning as before, we set <math>m = 17</math>, to start at the two-character string 'AB' leading up to the current position, set <math>i = 2</math>, and continue matching from the current position. This time we are able to complete the match, and return the position 17 as its origin.
 
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
i: 0123456
 
This time we are able to complete the match, and return the position 17 as its origin.
This search fails immediately, however, as the pattern still does not contain a ' ', so as in the first trial, we increment <math>m = 11</math> and reset <math>i = 0</math>, <math>j = 11</math>, and begin searching anew. Once again we immediately hit upon a match 'ABCDAB' but the next character, 'C', does not match the final character 'D' of the pattern. Reasoning as before, we set <math>m = 17</math>, to start at the two-character string 'AB' leading up to the current position, set <math>i = 2</math>, and continue matching from the current position. This time we are able to complete the match, and return the position 17 as its origin.
 
==The searching algorithm==
Line 22 ⟶ 46:
The above example is completely instructive in this regard. Once the table is compiled the search is trivial and can be done without ever repeatedly matching any character of the main string. It proceeds as follows:
 
# Let <math>i = j = m = 0</math>,; and letsay <math>P</math> havehas <math>n</math> characters, and <math>S</math>, <math>l</math> characters.
# CompareIf <math>m + i = l</math>, then we exit; there is no match. Otherwise, compare <math>P[i]</math> versus <math>S[jm + i]</math>. At this point we have already matched <math>i</math> previous characters.
#* If they are equal, set <math>i = i + 1</math>, <math>j = j + 1</math>. If <math>i = n</math> then we have found a full match; terminate the algorithm and return <math>m</math> as the start of the match.
#* If they are unequal, let <math>e = T[i - 1]</math> be anthe entry in the "partial match" table described below, at position <math>i - 1</math>. SetIf <math>e \neq -1</math>, set <math>m = jm + (i - e)</math>, and <math>i = e</math>; if <math>e \neq= -1</math>, orset <math>m = jm + 1</math>, <math>j = ji + 1</math>, and <math>i = 0</math> if <math>e = -1</math>.
# Return to step 2.
 
Line 88 ⟶ 112:
The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. Here follows the algorithm.
 
# Set <math>T[-1] = T[0] = -1</math>. This takes care of special conditions; we cannot compute <math>T[-1]</math> by searching for substrings, and our algorithm would give the incorrect answer 0 for <math>T[0]</math> would give the incorrect answer 0 as this is the unique case in which a single character is ''not'' a proper initial substring. Again, let <math>P</math> have <math>n</math> characters.
# Set <math>i = 1</math>, <math>j = -1</math>.
# If <math>i = n</math> then we are done; terminate the algorithm. Otherwise, set <math>j = T[i - 1]</math> and compare <math>P[i]</math> with <math>P[j + 1]</math>.
#* If they are equal, set <math>T[i] = j + 1</math>, <math>j = j + 1</math>, and <math>i = i + 1</math>.
#* If not, and if <math>j = 0-1</math>, set <math>T[i] = -1</math>, <math>i = i + 1</math>.
#* If not, and if <math>j \geq 0</math>, set <math>j = T[j]</math>.
# Return to step 3.
 
==Algorithmic complexity==
==Notes on the Algorithm==
 
Let <math>n</math> be the length of <math>P</math> and <math>l</math> be the length of <math>S</math>. Then the [[complexity]] of the search algorithm is <math>O(l)</math> and the complexity of the table algorithm is <math>O(n)</math>.
 
We compute the complexity of the search algorithm by counting the number of times each step can execute. The crucial stages are the two branches of step 2; the total number of times they execute is the number of times step 2 itself executes. The important fact about the second branch is that the value of <math>m</math> is guaranteed to increase by at least 1. This is clear if <math>T[i - 1] = -1</math>. In the case that <math>T[i - 1] \geq 0</math>, <math>m</math> increases if and only if <math>T[i - 1] < i</math>. By definition, <math>T[i - 1]</math> is the length of a ''proper'' terminal substring of the substring <math>P[0], \dots, P[i - 1]</math>, which has length <math>i</math>; hence any proper substring must have length at most <math>i - 1 < i</math>. Therefore <math>m</math> does indeed increase in the second branch, so this branch is executed at most <math>l</math> times, as the algorithm terminates when <math>m + i = l</math>.
 
The second branch of step 2 executes once each time a comparison is not a match; the first branch, likewise, executes once for each match. It is clear from the operations performed in the second branch that <math>m + i</math> is nondecreasing with successive iterations of step 2; therefore, since <math>i</math> actually increases each time the first branch executes, the index <math>m + i</math> of the character currently under scrutiny increases between successive iterations of the first branch. Hence a character which has once been matched is never subjected to comparison again. It follows that the first branch also executes at most once for each character of <math>S</math>, hence no more than <math>l</math> times. So step 2 executes at most <math>2l</math> times. As step 1 executes merely once, and step 3 executes each time step 2 does except when the algorithm exits, the number of steps performed by the algorithm in searching <math>S</math> is at most <math>4l</math>; hence the search algorithm exits in <math>O(l)</math> time.
This algorithm builds the next array as a [[finite state machine]] and uses it to dictate to the algorithm what to do at each point.
 
The complexity of the table algorithm is <math>O(n)</math>. As above, it is sufficient to show that step 3 executes in <math>O(n)</math> time, which will be done by examining the quantity <math>i - j - 1</math>. In the first branch, this quantity is preserved, as both <math>i</math> and <math>j</math> are incremented simultaneously. In the second branch, <math>i</math> is incremented and <math>j</math> is not, so <math>i - j - 1</math> increases. In the third branch, <math>j</math> is replaced by <math>T[j]</math>, which we saw above is always strictly less than <math>j</math>. Therefore <math>i - j - 1</math> always increases. Since <math>i \geq i - j - 1</math> (because <math>j \geq -1</math>), and since the algorithm terminates once <math>i = n</math>, this implies that it terminates after fewer than <math>n</math> iterations of step 3, since <math>i - j - 1</math> begins at <math>1</math>. Therefore the complexity of the table algorithm is <math>O(n)</math>.
The [[complexity]] of processing the pattern is O(''m''), where ''m'' is the length of the pattern.
 
The two complexities combined give an overall complexity of <math>O(n + l)</math> for the Knuth-Morris-Pratt algorithm.
The [[complexity]] of searching the string for the given pattern is O(''n'') where ''n'' is the length of the main string.
 
==External links==