Knuth–Morris–Pratt algorithm

This is an old revision of this page, as edited by Ryan Reich (talk | contribs) at 16:20, 4 February 2005 (Just a few typos and spacing corrected.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

The Knuth-Morris-Pratt string searching algorithm searches for occurrences of a "pattern" string within a main string by employing the simple observation that when a mismatch occurs, we have enough knowledge simply by possessing the pattern to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

The algorithm was invented by Knuth and Pratt and independently by J. H. Morris in 1976

An example of the algorithm

To motivate the technical details, we consider a specific instance.

S: ABC ABCDAB ABCDABCDABDE
P: ABCDABD

In this article, we will be using zero-based arrays for our strings; thus the 'C' in   will be denoted  . Let   be the start of the currently matched substring within  ,   the position within  , and   the absolute position within  .

We begin by matching the first three characters one after another. Thus in the fourth step,  ,  ,  . But   is a space and  , so we have a mismatch. Rather than beginning again  , we note that no 'A' occurs between positions 0 and 3 in   except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again. Therefore we move on to the next character, setting  ,  ,  .

We quickly obtain a nearly complete match 'ABCDAB' when, at  j = 10</math> we again have a discrepancy. However, just prior to the end of the current partial match we passed an 'AB', which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match the two characters prior to the current position, we need not check them again; we simply reset  ,   and continue matching the current character.

This search fails immediately, however, as the pattern still does not contain a ' ', so as in the first trial, we increment   and reset  ,  , and begin searching anew. Once again we immediately hit upon a match 'ABCDAB' but the next character, 'C', does not match the final character 'D' of the pattern. Reasoning as before, we set  , to start at the two-character string 'AB' leading up to the current position, set  , and continue matching from the current position. This time we are able to complete the match, and return the position 17 as its origin.

The searching algorithm

The above example is completely instructive in this regard. Once the table is compiled the search is trivial and can be done without ever repeatedly matching any character of the main string. It proceeds as follows:

  1. Let  , and let   have   characters.
  2. Compare   versus  . At this point we have already matched   previous characters.
    • If they are equal, set  ,  . If   then we have found a full match; terminate the algorithm and return   as the start of the match.
    • If they are unequal, let   be an entry in the "partial match" table described below, at position  . Set  , and   if  , or  ,  , and   if  .
  3. Return to step 2.

This clearly implements the algorithm performed in the example; at each mismatch, we consult the table for the next known possible beginning of a new match and update our counters accordingly. As promised, once any character is matched it is never considered again.

The "partial match" table

The goal of the table is to allow the algorithm not to check any character of the main string more than once. The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an initial segment of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position. In other words, we "pre-search" the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so.

We want to be able to look up, for each pattern position, the length of the longest possible initial match that ends in the current position other than the full match that probably just failed. Hence position   of the table is exactly the position of   within the longest possible initial segment of   which is also a terminal segment of the substring ending at  . For those positions which could not be part of a subpattern, we  , reflecting the fact that there are no fallbacks should the match fail there. To cover the possiblity of a mismatch in position 0, we also include an entry for index -1, which is also -1.

The table for 'ABCDABD' is as follows:

index -1 0 1 2 3 4 5 6
char - A B C D A B D
pos -1 -1 -1 -1 -1 0 1 -1

The first four entries are -1, since the only subpattern they appear in is the entire pattern. The number under the 'A' in position 4 is 0, because 'A' itself is a proper initial substring of the pattern ending at position 4, and   occurs at position 0 of it. The number under the following 'B' is 1 as 'AB' is a proper initial substring ending at position 5, and   occurs at position 1 of it. As 'D' is not in any subpattern, it receives a -1.

Compiling the table

Although the algorithm makes the eventual search as efficient as may reasonably be expected (each character of the main string is matched exactly once), it may seem that the work has been hidden in the compilation of the table itself. This is not so, as this is also efficient.

We consider the example of 'ABCDABD' first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. We set  . Since   is at the end of only the complete initial segment, we set   as well. To find  , we must discover a proper terminal substring of 'AB' which is also an initial substring of the pattern. But the only proper terminal substring of 'AB' is 'B', which is not an initial substring of the pattern, so we set  .

Continuing to 'C', we note that there is a shortcut to checking all terminal substrings: let us say that we discovered a terminal substring ending at 'C' with length 2; then its first character is an initial substring of an initial substring of the pattern, hence an initial substring itself...and it ends at 'B', which we already determined cannot occur. Therefore we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so  .

Similarly for 'D', giving  , so we pass to the subsequent 'A'. The same logic shows that the longest substring we need consider has length 1, and in this case 'A' does work; thus  . Considering now the following 'B', we exercise the following logic: if we were to find a subpattern beginning before the previous 'A' yet continuing to the current 'B', then in particular it would itself have a proper initial segment ending at 'A' yet beginning before 'A', which contradicts the fact that we already found that 'A' itself is the earliest occurrence of a proper subpattern ending at it. Therefore we need not look before that 'A' to find a subpattern for 'B'. In fact, checking it, we find that it continues to 'B' and that 'B' is the second entry of the substring of which 'A' is the first. Therefore the entry for 'B' in 'T' is one more than the entry for 'A', namely  .

Finally, the pattern does not continue from 'B' to 'D'. Similar reasoning as above shows that if we were to find a subpattern longer than 1 at 'D' then it must comprise a subpattern ending at 'B'; since the current one does not work, this one would have to be shorter. But the current one is an initial segment of the pattern ending at the second position, so this potential new subpattern would be as well, and we already decided there were none of those. Since 'D' itself is not a subpattern,  .

The table algorithm

The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. Here follows the algorithm.

  1. Set  . This takes care of special conditions; we cannot compute   by searching for substrings, and our algorithm for   would give the incorrect answer 0 as this is the unique case in which a single character is not a proper initial substring. Again, let   have   characters.
  2. Set  .
  3. If   then we are done; terminate the algorithm. Otherwise, set   and compare   with  .
    • If they are equal, set  ,  .
    • If not, and if  , set  ,  .
  4. Return to step 3.

Notes on the Algorithm

This algorithm builds the next array as a finite state machine and uses it to dictate to the algorithm what to do at each point.

The complexity of processing the pattern is O(m), where m is the length of the pattern.

The complexity of searching the string for the given pattern is O(n) where n is the length of the main string.