Content deleted Content added
Ryan Reich (talk | contribs) Completely rewritten for clarity and to provide a more explicit formulation of the algorithm. |
Ryan Reich (talk | contribs) A less catastrophic post-catastrophe update to clarify the clarified explanation by replacing many English words by a few mathematical symbols, and also to fix the algorithm. |
||
Line 1:
The '''Knuth-Morris-Pratt [[string searching algorithm]]'''
The algorithm was invented by [[Donald Knuth|Knuth]] and [[Vaughan Pratt|Pratt]] and independently by [[J. H. Morris]] in [[1976]]
==An example of the algorithm==
To motivate the technical details, we consider a specific instance.
In this article, we will be using [[Array#Indices into Arrays|zero-based arrays]] for our strings; thus the 'C' in <math>P</math> will be denoted <math>P[2]</math>. Let <math>m</math> be the start of the currently matched substring within <math>S</math>, <math>i</math> the position within <math>P</math>, and <math>j</math> the absolute position within <math>S</math>.
We begin by matching the first three characters one after another. Thus in the fourth step, the main position is 0, the current position is 3, and the pattern position is 3. But the current character is a space ' ' and the pattern character a 'D', so we have a mismatch. Rather than beginning again from main position 1, we note that no 'A' occurs between pattern positions 0 and 3 except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again. Therefore we move on to the next character and set the main position to 4, and start checking again from pattern position 0.▼
▲We begin by matching the first three characters one after another. Thus in the fourth step,
We quickly obtain a nearly complete match 'ABCDAB' when, at pattern position 6 and main position 10, we again have a discrepancy. However, just prior to the end of the current partial match we passed an 'AB', which could be the beginning of a new match, so we must take this into consideration. However, we already know that these characters match the two characters prior to the current position, so we need not check them again; we simply reset the main position to 8 and continue searching from pattern position 2 at the current position.▼
▲We quickly obtain a nearly complete match 'ABCDAB' when, at
This search fails immediately, however, as the pattern still does not contain a ' ', so as in the first trial, we increment the main position to 11, reset the pattern position to 0, and begin searching anew. Once again we immediately hit upon a match 'ABCDAB' but the next character, 'C', does not match the final character 'D' of the pattern. Reasoning as before, we set the main position to start at the two-character string 'AB' leading up to the current position (hence main position 17), set the pattern position to 2, and continue matching from the current position. This time we are able to complete the match, and return the index 17 as its origin.▼
▲This search fails immediately, however, as the pattern still does not contain a ' ', so as in the first trial, we increment
==The searching algorithm==
Line 20 ⟶ 22:
The above example is completely instructive in this regard. Once the table is compiled the search is trivial and can be done without ever repeatedly matching any character of the main string. It proceeds as follows:
# Let <math>i = j = m = 0</math>
# Compare <math>P[i]</math> versus <math>S[j]</math>. At this point we have already matched <math>i</math> previous characters.
#* If they are equal, set <math>i = i + 1</math>, <math>j = j + 1</math>. If <math>i = n</math> then we have found a full match; terminate the algorithm and return <math>m</math> as the start of the match.
#* If they are unequal,
# Return to step 2.
Line 32 ⟶ 34:
The goal of the table is to allow the algorithm not to check any character of the main string more than once. The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an ''initial segment'' of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position. In other words, we "pre-search" the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so.
We want to be able to look up, for each pattern position, the length of the longest possible initial match that ends in the current position other than the full match that probably just failed. Hence position
The table for 'ABCDABD' is as follows:
Line 58 ⟶ 60:
|-
!pos
|
|
|
|
|
|
|
|
|}
The first four entries are
==Compiling the table==
Line 74 ⟶ 76:
Although the algorithm makes the eventual search as efficient as may reasonably be expected (each character of the main string is matched exactly once), it may seem that the work has been hidden in the compilation of the table itself. This is not so, as this is ''also'' efficient.
We consider the example of 'ABCDABD' first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons.
Continuing to 'C', we note that there is a shortcut to checking ''all'' terminal substrings: let us say that we discovered a terminal substring ending at 'C' with length 2; then its first character is an initial substring of an initial substring of the pattern, hence an initial substring itself...and it ends at 'B', which we already determined cannot occur. Therefore we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so <math>T
Similarly for 'D', giving <math>T
Finally, the pattern does ''not'' continue from 'B' to 'D'. Similar reasoning as above shows that if we were to find a subpattern longer than 1 at 'D' then it must comprise a subpattern ending at 'B'; since the current one does not work, this one would have to be shorter. But the current one is an initial segment of the pattern ending at the second position, so this potential new subpattern would be as well, and we already decided there were none of those. Since 'D' itself is not a subpattern, <math>T
==The table algorithm==
Line 86 ⟶ 88:
The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. Here follows the algorithm.
# Set <math>T[-1] = T[0] = -1</math>. This takes care of special conditions; we cannot compute <math>T[-1]</math> by searching for substrings, and our algorithm for <math>T[0]</math> would give the incorrect answer 0 as this is the unique case in which a single character is ''not'' a proper initial substring. Again, let <math>P</math> have <math>n</math> characters.
# Set <math>
# If <math>i = n</math> then we are done; terminate the algorithm. Otherwise, set <math>j = T[i - 1]</math> and compare <math>P[i]</math> with <math>P[j + 1]</math>.
#* If they are equal, set <math>T[i] = j + 1</math>, <math>i = i + 1</math>.
#* If
# Return to step 3.
==Notes on the Algorithm==
Line 105 ⟶ 107:
[[Category:Algorithms on strings]]
|