Knuth–Morris–Pratt algorithm: Difference between revisions

Content deleted Content added
Ryan Reich (talk | contribs)
Completely rewritten for clarity and to provide a more explicit formulation of the algorithm.
Line 1:
{{cleanup}}
 
The '''Knuth-Morris-Pratt [[string searching algorithm]]''' locates a pattern of characters within a string. It uses the fact that after finding the first mismatch between pattern and string we already know the characters compared before to improve the efficiency of the search. We therefore will have to compare the pattern to itself to determine how many positions we have to move to the left.
 
The algorithm was invented by [[Donald Knuth|Knuth]] and [[Vaughan Pratt|Pratt]] and independently by [[J. H. Morris]] in [[1976]].
 
==An example of the algorithm==
==Overview of Knuth-Morris-Pratt Algorithm==
 
To motivate the technical details, we consider a specific instance. Take:
* Pre-compute table of next possible matching positions
* Try to match the characters
** If there is a mismatch, use the look-up table to find where to start matching next
*** If the relevant part of the table is -1, increment the position on the main string
*** If any other number, begin comparison at this ___location on the pattern but at the current position on the main string
** If there is a complete match, you can find the start position by taking the pattern index of the last matched character away from the current position
** After a complete match, check the last position in the next array: this will tell you if you can attempt another match starting in this string again
* Return matches
 
ABCmain ABCstring: ABC ABDABABCDAB ABCDABCDABDE
==Example==
pattern: ABCDABD
 
We begin by matching the first three characters one after another. Thus in the fourth step, the main position is 0, the current position is 3, and the pattern position is 3. But the current character is a space ' ' and the pattern character a 'D', so we have a mismatch. Rather than beginning again from main position 1, we note that no 'A' occurs between pattern positions 0 and 3 except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again. Therefore we move on to the next character and set the main position to 4, and start checking again from pattern position 0.
String: ABC ABC ABC ABDAB ABCDABCDABDE
pattern: ABCDABD
 
We quickly obtain a nearly complete match 'ABCDAB' when, at pattern position 6 and main position 10, we again have a discrepancy. However, just prior to the end of the current partial match we passed an 'AB', which could be the beginning of a new match, so we must take this into consideration. However, we already know that these characters match the two characters prior to the current position, so we need not check them again; we simply reset the main position to 8 and continue searching from pattern position 2 at the current position.
===Building 'next' table===
 
This search fails immediately, however, as the pattern still does not contain a ' ', so as in the first trial, we increment the main position to 11, reset the pattern position to 0, and begin searching anew. Once again we immediately hit upon a match 'ABCDAB' but the next character, 'C', does not match the final character 'D' of the pattern. Reasoning as before, we set the main position to start at the two-character string 'AB' leading up to the current position (hence main position 17), set the pattern position to 2, and continue matching from the current position. This time we are able to complete the match, and return the index 17 as its origin.
 
==The searching algorithm==
 
The above example is completely instructive in this regard. Once the table is compiled the search is trivial and can be done without ever repeatedly matching any character of the main string. It proceeds as follows:
 
# Let <math>i = m = 0</math>. ''i'' will be the pattern position and ''m'' the main position.
# Assume inductively that we have matched characters <math>0, \dots, i - 1</math> of the pattern against characters <math>m, \dots, m + i - 1</math> of the main string, and are currently considering the next one. If ''i'' is the length of the pattern we have found a match; we return ''m'' as its origin and terminate. Otherwise, we compare pattern character ''i'' with main character ''m + i''.
#* If they are equal, set <math>i = i + 1</math>.
#* If they are unequal, consult the "partial match" table described below, at position ''i - 1'', and obtain the index ''e'' stored there. Set <math>m = m + i - e, i = e</math>.
*# Return matchesto step 2.
 
This clearly implements the algorithm performed in the example; at each mismatch, we consult the table for the next known possible beginning of a new match and update our counters accordingly. As promised, once any character is matched it is never considered again.
 
==The "partial match" table==
 
The goal of the table is to allow the algorithm not to check any character of the main string more than once. The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an ''initial segment'' of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position. In other words, we "pre-search" the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so.
 
We want to be able to look up, for each pattern position, the length of the longest possible initial match that ends in the current position other than the full match that probably just failed. Hence position ''n'' of the table is exactly the length of the longest possible initial segment of the pattern which is also a terminal segment of the substring ending at position ''n'' of the pattern. For those positions which ''could not'' be part of a subpattern, we set the entry to 0, reflecting the fact that there are no fallbacks should the match fail there. To cover the possiblity of a mismatch in position 0, we also include an entry for index -1, which is also 0.
 
The table for 'ABCDABD' is as follows:
 
{| border="1" cellpadding="2"
!index
| -1
|0
|1
Line 32 ⟶ 46:
|5
|6
|7
|-
!char
| -
|A
|B
Line 42 ⟶ 56:
|B
|D
| -
|-
!pos
| -10
|0
|0
Line 55 ⟶ 68:
|}
 
The first four entries are 0, since the only subpattern they appear in is the entire pattern. The number under the 'A' in position 4 is 1, because 'A' itself is a proper initial substring of the pattern ending at position 4, and has length 1. The number under the following 'B' is 2 as 'AB' is a proper initial substring ending at position 5, with length 2. As 'D' is not in any subpattern, it receives a 0.
This is the precomputed table of offsets generated by the algorithm. If we look at the table, we can see that these numbers are the maximum number of characters matchable before the ''n''th position on the string (without matching the entire prefix of ''n'' characters).
 
For example take position 6
 
ABCDABD
ABCDABD
^
 
The pointer is at position 6. As seen above, the maximum number of characters that can be matched before position 6 is 2. Thus, the array entry of 6 is 2.
 
===Processing text===
 
We now have the table so we can use the algorithm to check the string for matches:
 
ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
 
We find that the first three match, but the fourth does not, so we look up the fourth character (or index 3) in the table, this gives us zero. We should now try matching position zero on the pattern with the fourth character on the string:
 
===BuildingCompiling 'next'the table===
ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
 
Although the algorithm makes the eventual search as efficient as may reasonably be expected (each character of the main string is matched exactly once), it may seem that the work has been hidden in the compilation of the table itself. This is not so, as this is ''also'' efficient.
This time the match fails on the first character (or index 0) on the pattern, which means we go to the lookup table, the table is -1 therefore we now try the fifth character on the main string:
 
We consider the example of 'ABCDABD' first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. Let <math>T(n)</math> denote the table entry in position ''n'', where <math>T(0) = 0</math>. We therefore begin by calculating <math>T(1)</math>, meaning that we must discover a proper terminal substring of 'AB' which is also an initial substring of the pattern. But the only proper terminal substring of 'AB' is 'B', which is not an initial substring of the pattern, so we set <math>T(1) = 0</math>.
ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
 
Continuing to 'C', we note that there is a shortcut to checking ''all'' terminal substrings: let us say that we discovered a terminal substring ending at 'C' with length 2; then its first character is an initial substring of an initial substring of the pattern, hence an initial substring itself...and it ends at 'B', which we already determined cannot occur. Therefore we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so <math>T(2) = 0</math>.
We continue to increment in this way, sometimes matching, sometimes not, until we reach the end of 'ABCDAB'.
 
Similarly for 'D', giving <math>T(3) = 0</math>, so we pass to the subsequent 'A'. The same logic shows that the longest substring we need consider has length 1, and in this case 'A' ''does'' work; thus <math>T(4) = 1</math>. Considering now the following 'B', we exercise the following logic: if we were to find a subpattern beginning before the previous 'A' yet continuing to the current 'B', then in particular it would itself have a proper initial segment ending at 'A' yet beginning before 'A', which contradicts the fact that we already found that 'A' itself is the earliest occurrence of a proper subpattern ending at it. Therefore we need not look before that 'A' to find a subpattern for 'B'. In fact, checking it, we find that it continues to 'B' and that 'B' is the second entry of the substring of which 'A' is the first. Therefore the entry for 'B' in 'T' is one more than the entry for 'A', namely <math>T(5) = 2</math>.
ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
 
Finally, the pattern does ''not'' continue from 'B' to 'D'. Similar reasoning as above shows that if we were to find a subpattern longer than 1 at 'D' then it must comprise a subpattern ending at 'B'; since the current one does not work, this one would have to be shorter. But the current one is an initial segment of the pattern ending at the second position, so this potential new subpattern would be as well, and we already decided there were none of those. Since 'D' itself is not a subpattern, <math>T(6) = 0</math>.
This last 'D' does not match, so we look up in the table, find 2 so we start from position 2:
 
==The table algorithm==
ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
 
The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. Here follows the algorithm.
So we match the remaining characters and when we reach the end of the pattern we know it is a match.
 
# Set <math>T(-1) = 0</math>.
We check the last piece of the array (index 7) since we have had a match, it contains position zero. So we check the last 'E' against 'A', it is not a match and we have finished searching the string.
# Assume inductively that we have calculated <math>T(-1), \dots, T(n - 1)</math> and are currently calculating <math>T(n)</math>. Furthermore, we have chosen each of the previous values such that the longest proper initial segment of the pattern which is also a substring terminating in character ''k'' of the pattern has length <math>T(k)</math> (and the empty string has length 0).
# Set <math>i = n - 1</math>.
# Set <math>i = T(i)</math>. At this point, we know that the longest initial segment of the pattern that could possibly terminate at character ''n'' must start no earlier than ''n - i''. In particular, character ''n'' will be character ''i'' in the continued pattern, which would be character ''i'' of the entire pattern.
#* If character ''n'' equals character ''i'' then the subpattern terminating at ''n - 1'' continues to ''n'', and we set <math>T(n) = i + 1, n = n + 1</math>, and return to step 2.
#* If not, then the subpattern does ''not'' continue, and we must find the next candidate. If <math>i > 0</math> then as in the example above at the final 'D', we may begin searching for strings of length ''T(i)'' ending at character ''i - 1''. Return to step 4 if so. If <math>i = 0</math>, then there is nothing left to search, so set <math>T(n) = 0, n = n + 1</math>, and return to step 2.
 
==Notes on the Algorithm==
Line 105 ⟶ 97:
This algorithm builds the next array as a [[finite state machine]] and uses it to dictate to the algorithm what to do at each point.
 
The [[complexity]] of processing the pattern is oO(''m''), where ''m'' is the numberlength of characters in the stringpattern.
 
The [[complexity]] of searching the string for the given pattern is oO(''n'') where ''n'' is the length of the main string.
 
==External links==