Knuth–Morris–Pratt algorithm

The Knuth-Morris-Pratt string searching algorithm locates a pattern of characters within a string. It uses the fact that after finding the first mismatch between pattern and string we already know the characters compared before to improve the efficiency of the search. We therefore will have to compare the pattern to itself to determine how many positions we have to move to the left.

The algorithm was invented by Knuth and Pratt and independently by J. H. Morris in 1976.

Overview of Knuth-Morris-Pratt Algorithm

Pre-compute table of next possible matching positions
Try to match the characters
- If there is a mismatch, use the look-up table to find where to start matching next
  - If the relevant part of the table is -1, increment the position on the main string
  - If any other number, begin comparison at this ___location on the pattern but at the current position on the main string
- If there is a complete match, you can find the start position by taking the pattern index of the last matched character away from the current position
- After a complete match, check the last position in the next array: this will tell you if you can attempt another match starting in this string again
Return matches

Example

String:    ABC ABC ABC ABDAB ABCDABCDABDE
pattern: ABCDABD

Building 'next' table

index	0	1	2	3	4	5	6	7
char	A	B	C	D	A	B	D	-
pos	-1	0	0	0	0	1	2	0

This is the precomputed table of offsets generated by the algorithm. If we look at the table, we can see that these numbers are the maximum number of characters matchable before the nth position on the string (without matching the entire prefix of n characters).

For example take position 6

ABCDABD
    ABCDABD
      ^

The pointer is at position 6. As seen above, the maximum number of characters that can matched before position 6 is 2. Thus, the array entry of 6 is 2.

Processing text

We now have the table so we can use the algorithm to check the string for matches:

ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
   ^

We find that the first three match, but the fourth does not, so we look up the fourth character (or index 3) in the table, this gives us zero. We should now try matching position zero on the pattern with the fourth character on the string:

ABC ABC ABC ABDAB ABCDABCDABDE
   ABCDABD
   ^

This time the match fails on the first character (or index 0) on the pattern, which means we go to the lookup table, the table is -1 therefore we now try the fifth character on the main string:

ABC ABC ABC ABDAB ABCDABCDABDE
    ABCDABD
    ^

We continue to increment in this way, sometimes matching, sometimes not, until we reach the end of 'ABCDAB'.

ABC ABC ABC ABDAB ABCDABCDABDE
                  ABCDABD
                        ^

This last 'D' does not match, so we look up in the table, find 2 so we start from position 2:

ABC ABC ABC ABDAB ABCDABCDABDE
                      ABCDABD
                        ^

So we match the remaining characters and when we reach the end of the pattern we know it is a match.

We check the last piece of the array (index 7) since we have had a match, it contains position zero. So we check the last 'E' against 'A', it is not a match and we have finished searching the string.

Notes on the Algorithm

This algorithm builds the next array as a finite state machine and uses it to dictate to the algorithm what to do at each point.

The complexity of processing the pattern is o(m), where m is the number of characters in the string.

The complexity of searching the string for the given pattern is o(n) where n is the length of the main string.