Knuth–Morris–Pratt algorithm: Difference between revisions

Content deleted Content added
Ryan Reich (talk | contribs)
Tighten up the search algorithm and add a section on the time complexity.
Ryan Reich (talk | contribs)
Yet another revision of the algorithm. Never underestimate off-by-one errors.
Line 3:
The algorithm was invented by [[Donald Knuth|Knuth]] and [[Vaughan Pratt|Pratt]] and independently by [[J. H. Morris]] in [[1976]]
 
==The KMP Algorithm==
==An example of the algorithm==
 
To motivate the technical details, we consider a specific instance. In this article, we will be using [[Array#Indices into Arrays|zero-based arrays]] for our strings; thus the 'C' in <math>P</math> will be denoted <math>P[2]</math>. Let <math>m</math> be the start of the currently matched substring within <math>S</math> and <math>i</math> the position within <math>P</math>.
Line 12:
i: 0123456
 
We begin by matching the first three characters one after another. Thus in the fourth step, <math>m = 0</math> and <math>i = 3</math>. But <math>S[3]</math> is a space and <math>P[3] = 'D'</math>, so we have a mismatch. Rather than beginning again at <math>m = 1</math>, we note that no 'A' occurs between positions 0 and 3 in <math>P</math> except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again. Therefore we move on to the next character, setting <math>m = 4</math> and <math>i = 0</math>.
 
m: 01234567890123456789012
Line 19:
i: 0123456
 
We quickly obtain a nearly complete match 'ABCDAB' when, at <math>i = 6</math>, we again have a discrepancy. However, just prior to the end of the current partial match, we passed an 'AB', which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match the two characters prior to the current position, we need not check them again; we simply reset <math>m = 8</math>, <math>i = 2</math> and continue matching the current character.
 
m: 01234567890123456789012
Line 42:
This time we are able to complete the match, and return the position 17 as its origin.
 
===The searching algorithm===
 
The above example is completely instructive in this regard. We assume the existence of a "partial match" table, described below, which indicates where we need to look for the start of a new match in the event that the current one ends in a mismatch. For the moment the table, <math>T</math>, should be taken as a [[black box]] with the property that if we have a match starting at <math>S[m]</math> that fails when comparing <math>S[m + i]</math> to <math>P[i]</math>, then the next possible match will start at <math>m + i - T[i - 1]</math>. In particular, <math>T[-1]</math> is defined and equals <math>-1</math>. Knowing this, the algorithm is very simple:
The above example is completely instructive in this regard. Once the table is compiled the search is trivial and can be done without ever repeatedly matching any character of the main string. It proceeds as follows:
 
# Let <math>i = m = 0</math>; say <math>P</math> has <math>n</math> characters, and <math>S</math>, <math>l</math> characters.
# If <math>m + i = l</math>, then we exit; there is no match. Otherwise, compare <math>P[i]</math> versus <math>S[m + i]</math>. At this point we have already matched <math>i</math> previous characters.
#* If they are equal, set <math>i = i + 1</math>. If <math>i = n</math> then we have found a full match; terminate the algorithm and return <math>m</math> as the start of the match.
#* If they are unequal, let <math>e = T[i - 1]</math> be the entry in the "partial match" table described below at position <math>i - 1</math>. If <math>e \neq -1</math>, setSet <math>m = m + (i - e)</math> and, <math>i = e</math>; if <math>e = -1</math>, set <math>m = m + i + 1</math> and <math>i = 0</math>.
# Return to step 2.
 
This clearly implements the algorithm performed in the example; at each mismatch, we consult the table for the next known possible beginning of a new match and update our counters accordingly. AsThus promised,we onceneed anynever backtrack; in particular, each character is matched itonly isonce never(though potentially ''mis''matched many times; an analysis of the efficiency of this algorithm is consideredgiven againbelow).
 
===The efficiency of the search algorithm===
 
We computeAssuming the complexityprior existence of the table <math>T</math>, the search portion of the Knuth-Morris-Pratt algorithm has [[complexity]] <math>O(l)</math>, where <math>l</math> is the length of <math>S</math>. We compute this by counting the number of times each step can execute. The crucial stages are the two branches of step 2; the total number of times they execute is the number of times step 2 itself executes. The important fact about the second branch is that the value of <math>m</math> is guaranteed to increase by at least 1. ThisSince iswe clear ifset <math>T[i - 1]m = -1</math>.m + In the case that <math>T[(i - 1] \geq 0e)</math>, <math>m</math> increases if and only if <math>e = T[i - 1] < i</math>. By definition, <math>T[i - 1]</math> is the length of a ''proper'' terminal substring of the substring<math>i</math> characters <math>P[0], \dots, P[i - 1]</math>,; whichhence has<math>T[i length- 1]</math> is at most <math>i - 1 < i</math>;. hence anyThis properargument substringholds mustwhen have<math>i length> at0</math>; mostwhen <math>i = 0</math> we have <math>T[i - 1] = -1 < 0 = i</math>, so the inequality again holds. Therefore <math>m</math> does indeed increase in the second branch, so this branch is executed at most <math>l</math> times, as the algorithm terminates when <math>m + i = l</math>.
 
The second branch of step 2 executes once each time a comparison is not a match; the first branch, likewise, executes once for each match. It is clear from the operations performed in the second branch that <math>m + i</math> is nondecreasing with successive iterations of step 2; therefore, since <math>i</math> actually increases each time the first branch executes, the index <math>m + i</math> of the character currently under scrutiny increases between successive iterations of the first branch. Hence a character which has once been matched is never subjected to comparison again. It follows that the first branch also executes at most once for each character of <math>S</math>, hence no more than <math>l</math> times. So step 2 executes at most <math>2l</math> times. As step 1 executes merely once, and step 3 executes each time step 2 does except when the algorithm exits, the number of steps performed by the algorithm in searching <math>S</math> is at most <math>4l</math>; hence the search algorithm exits in <math>O(l)</math> time.
 
==The "partial match" table==
Line 58 ⟶ 64:
The goal of the table is to allow the algorithm not to check any character of the main string more than once. The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an ''initial segment'' of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position. In other words, we "pre-search" the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so.
 
We want to be able to look up, for each pattern position, the length of the longest possible initial match that ends in the current position other than the full match that probably just failed. Hence position <math>T[i]</math> of the table is exactly the positionlength of <math>P[i]</math> within the longest possible initial segment of <math>P</math> which is also a terminal segment of the substring ending at <math>P[i]</math>. ForWe thoseuse positionsthe whichconvention ''couldthat not''the beempty partstring ofhas alength subpattern,0. we <math>T[i]Since =a -1</math>,mismatch reflectingat the factvery thatstart thereof arethe nopattern fallbacksis shoulda the matchspecial failcase (there. is Tono cover the possiblitypossibility of a mismatch in position 0backtracking), we alsoset include<math>T[-1] an entry for index= -1</math>, whichas isdiscussed also -1above.
 
We consider the example of 'ABCDABD' first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. We set <math>T[-1] = -10</math>. Since <math>P[0]</math> is at the end of only the complete initial segment, we set <math>T[0] = -10</math> as well. To find <math>T[1]</math>, we must discover a proper terminal substring of 'AB' which is also an initial substring of the pattern. But the only proper terminal substring of 'AB' is 'B', which is not an initial substring of the pattern, so we set <math>T[1] = -10</math>.
The table for 'ABCDABD' is as follows:
 
Continuing to 'C', we note that there is a shortcut to checking ''all'' terminal substrings: let us say that we discovered a terminal substring ending at 'C' with length 2; then its first character is an initial substring of an initial substring of the pattern, hence an initial substring itself...and it ends at 'B', which we already determined cannot occur. Therefore we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so <math>T[2] = -10</math>.
 
Similarly for 'D', giving <math>T[3] = -10</math>, so we pass to the subsequent 'A'. The same logic shows that the longest substring we need consider has length 1, and in this case 'A' ''does'' work; thus <math>T[4] = 01</math>. Considering now the following 'B', we exercise the following logic: if we were to find a subpattern beginning before the previous 'A' yet continuing to the current 'B', then in particular it would itself have a proper initial segment ending at 'A' yet beginning before 'A', which contradicts the fact that we already found that 'A' itself is the earliest occurrence of a proper subpattern ending at it. Therefore we need not look before that 'A' to find a subpattern for 'B'. In fact, checking it, we find that it continues to 'B' and that 'B' is the second entry of the substring of which 'A' is the first. Therefore the entry for 'B' in 'T' is one more than the entry for 'A', namely <math>T[5] = 12</math>.
 
Finally, the pattern does ''not'' continue from 'B' to 'D'. Similar reasoning as above shows that if we were to find a subpattern longer than 1 at 'D' then it must comprise a subpattern ending at 'B'; since the current one does not work, this one would have to be shorter. But the current one is an initial segment of the pattern ending at the second position, so this potential new subpattern would be as well, and we already decided there were none of those. Since 'D' itself is not a subpattern, <math>T[6] = -10</math>.
 
Therefore we compile the following table:
 
{| border="1" cellpadding="2"
!<math>i</math>
!index
| -1
|0
Line 73 ⟶ 87:
|6
|-
!<math>P[i]</math>
!char
| -
|A
|B
Line 83 ⟶ 97:
|D
|-
!<math>T[i]</math>
!pos
| -1
| -1
| -1
| -1
| -1
| 0
| 1
| -1
| 0
|0
|0
|0
| -1
|2
|0
|}
 
===The table algorithm===
The first four entries are -1, since the only subpattern they appear in is the entire pattern. The number under the 'A' in position 4 is 0, because 'A' itself is a proper initial substring of the pattern ending at position 4, and <math>P[4]</math> occurs at position 0 of it. The number under the following 'B' is 1 as 'AB' is a proper initial substring ending at position 5, and <math>P[5]</math> occurs at position 1 of it. As 'D' is not in any subpattern, it receives a -1.
 
==Compiling the table==
 
Although the algorithm makes the eventual search as efficient as may reasonably be expected (each character of the main string is matched exactly once), it may seem that the work has been hidden in the compilation of the table itself. This is not so, as this is ''also'' efficient.
 
The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. Here follows the algorithm; to eliminate special cases we will use the convention that <math>P[-1]</math> is defined and its value is unequal to any possible character in <math>P</math>.
We consider the example of 'ABCDABD' first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. We set <math>T[-1] = -1</math>. Since <math>P[0]</math> is at the end of only the complete initial segment, we set <math>T[0] = -1</math> as well. To find <math>T[1]</math>, we must discover a proper terminal substring of 'AB' which is also an initial substring of the pattern. But the only proper terminal substring of 'AB' is 'B', which is not an initial substring of the pattern, so we set <math>T[1] = -1</math>.
 
# Set <math>iT[-1] = -1</math>, and let <math>jP</math> =have -1<math>n</math> characters.
Continuing to 'C', we note that there is a shortcut to checking ''all'' terminal substrings: let us say that we discovered a terminal substring ending at 'C' with length 2; then its first character is an initial substring of an initial substring of the pattern, hence an initial substring itself...and it ends at 'B', which we already determined cannot occur. Therefore we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so <math>T[2] = -1</math>.
#* If not, and ifSet <math>ji \geq= 0</math>, set <math>j = T[ji - 1]</math>.
 
# If <math>i = n</math> then we are done; terminate the algorithm. Otherwise, compare <math>P[i]</math> with <math>P[j + 1]</math>.
Similarly for 'D', giving <math>T[3] = -1</math>, so we pass to the subsequent 'A'. The same logic shows that the longest substring we need consider has length 1, and in this case 'A' ''does'' work; thus <math>T[4] = 0</math>. Considering now the following 'B', we exercise the following logic: if we were to find a subpattern beginning before the previous 'A' yet continuing to the current 'B', then in particular it would itself have a proper initial segment ending at 'A' yet beginning before 'A', which contradicts the fact that we already found that 'A' itself is the earliest occurrence of a proper subpattern ending at it. Therefore we need not look before that 'A' to find a subpattern for 'B'. In fact, checking it, we find that it continues to 'B' and that 'B' is the second entry of the substring of which 'A' is the first. Therefore the entry for 'B' in 'T' is one more than the entry for 'A', namely <math>T[5] = 1</math>.
 
Finally, the pattern does ''not'' continue from 'B' to 'D'. Similar reasoning as above shows that if we were to find a subpattern longer than 1 at 'D' then it must comprise a subpattern ending at 'B'; since the current one does not work, this one would have to be shorter. But the current one is an initial segment of the pattern ending at the second position, so this potential new subpattern would be as well, and we already decided there were none of those. Since 'D' itself is not a subpattern, <math>T[6] = -1</math>.
 
==The table algorithm==
 
The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. Here follows the algorithm.
 
# Set <math>T[-1] = T[0] = -1</math>. This takes care of special conditions; we cannot compute <math>T[-1]</math> by searching for substrings, and our algorithm would give the incorrect answer 0 for <math>T[0]</math> as this is the unique case in which a single character is ''not'' a proper initial substring. Again, let <math>P</math> have <math>n</math> characters.
# Set <math>i = 1</math>, <math>j = -1</math>.
# If <math>i = n</math> then we are done; terminate the algorithm. Otherwise, compare <math>P[i]</math> with <math>P[j + 1]</math>.
#* If they are equal, set <math>T[i] = j + 1</math>, <math>j = j + 1</math>, and <math>i = i + 1</math>.
#* If not, and if <math>j > 0</math>, set <math>j = -1T[j]</math>.
#* Otherwise, set <math>T[i] = -10</math>, <math>i = i + 1</math>.
#* If not, and if <math>j \geq 0</math>, set <math>j = T[j]</math>.
# Return to step 3.
 
===Efficiency of the table-building algorithm===
==Algorithmic complexity==
 
The complexity of the table algorithm is <math>O(n)</math>, where <math>n</math> is the length of <math>P</math>. As aboveexcept for some initialization all the work is done in step 3, it is sufficient to show that step 3 executes in <math>O(n)</math> time, which will be done by simultaneously examining the quantityquantities <math>i</math> -and j<math>i - 1j</math>. In the first branch, this<math>i quantity- j</math> is preserved, as both <math>i</math> and <math>j</math> are incremented simultaneously, but naturally, <math>i</math> is increased. In the second branch, <math>ij</math> is incrementedreplaced andby <math>T[j]</math>, which we saw above is notalways strictly less than <math>j</math>, sothus increasing <math>i - j - 1</math> increases. In the third branch, <math>ji</math> is replacedincremented byand <math>T[j]</math>, which we saw above is alwaysnot, strictlyso less thanboth <math>ji</math>. Thereforeand <math>i - j - 1</math> always increasesincrease. Since <math>i \geq i - j - 1</math>, (becausethis means that at each stage either <math>ji</math> \geqor -1a lower bound for <math>i</math>), andincreases; therefore since the algorithm terminates once <math>i = n</math>, thisit impliesmust that it terminatesterminate after fewerat thanmost <math>n</math> iterations of step 3, since <math>i - j - 1</math> begins at <math>1</math>. Therefore the complexity of the table algorithm is <math>O(n)</math>.
Let <math>n</math> be the length of <math>P</math> and <math>l</math> be the length of <math>S</math>. Then the [[complexity]] of the search algorithm is <math>O(l)</math> and the complexity of the table algorithm is <math>O(n)</math>.
 
We compute the complexity of the search algorithm by counting the number of times each step can execute. The crucial stages are the two branches of step 2; the total number of times they execute is the number of times step 2 itself executes. The important fact about the second branch is that the value of <math>m</math> is guaranteed to increase by at least 1. This is clear if <math>T[i - 1] = -1</math>. In the case that <math>T[i - 1] \geq 0</math>, <math>m</math> increases if and only if <math>T[i - 1] < i</math>. By definition, <math>T[i - 1]</math> is the length of a ''proper'' terminal substring of the substring <math>P[0], \dots, P[i - 1]</math>, which has length <math>i</math>; hence any proper substring must have length at most <math>i - 1 < i</math>. Therefore <math>m</math> does indeed increase in the second branch, so this branch is executed at most <math>l</math> times, as the algorithm terminates when <math>m + i = l</math>.
 
The second branch of step 2 executes once each time a comparison is not a match; the first branch, likewise, executes once for each match. It is clear from the operations performed in the second branch that <math>m + i</math> is nondecreasing with successive iterations of step 2; therefore, since <math>i</math> actually increases each time the first branch executes, the index <math>m + i</math> of the character currently under scrutiny increases between successive iterations of the first branch. Hence a character which has once been matched is never subjected to comparison again. It follows that the first branch also executes at most once for each character of <math>S</math>, hence no more than <math>l</math> times. So step 2 executes at most <math>2l</math> times. As step 1 executes merely once, and step 3 executes each time step 2 does except when the algorithm exits, the number of steps performed by the algorithm in searching <math>S</math> is at most <math>4l</math>; hence the search algorithm exits in <math>O(l)</math> time.
 
==An exampleEfficiency of the KNP algorithm==
The complexity of the table algorithm is <math>O(n)</math>. As above, it is sufficient to show that step 3 executes in <math>O(n)</math> time, which will be done by examining the quantity <math>i - j - 1</math>. In the first branch, this quantity is preserved, as both <math>i</math> and <math>j</math> are incremented simultaneously. In the second branch, <math>i</math> is incremented and <math>j</math> is not, so <math>i - j - 1</math> increases. In the third branch, <math>j</math> is replaced by <math>T[j]</math>, which we saw above is always strictly less than <math>j</math>. Therefore <math>i - j - 1</math> always increases. Since <math>i \geq i - j - 1</math> (because <math>j \geq -1</math>), and since the algorithm terminates once <math>i = n</math>, this implies that it terminates after fewer than <math>n</math> iterations of step 3, since <math>i - j - 1</math> begins at <math>1</math>. Therefore the complexity of the table algorithm is <math>O(n)</math>.
 
TheSince the two complexitiesportions combinedof givethe analgorithm overallhave, complexityrespectively, complexities of <math>O(nl)</math> +and l<math>O(n)</math>, forthe complexity of the Knuth-Morris-Prattoverall algorithm is <math>O(n + l)</math>.
 
==External links==