Content deleted Content added
Shell Kinney (talk | contribs) missing period (you can help!) |
Quuxplusone (talk | contribs) m minor fmting, add wikisourcepar |
||
Line 1:
{{wikisource}}
The '''Knuth-Morris-Pratt [[string searching algorithm]]''' searches for occurrences of a "pattern" string <math>P</math> within a main string <math>S</math> by employing the simple observation that when a mismatch occurs, we have enough knowledge simply by possessing the pattern to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
Line 4 ⟶ 5:
==The KMP algorithm==
To motivate the technical details, we consider a specific instance. In this article, we will be using [[Array#Indices into Arrays|zero-based arrays]] for our strings; thus the 'C' in <math>P</math> will be denoted <math>P[2]</math>. Let <math>m</math> be the start of the currently matched substring within <math>S</math> and <math>i</math> the position within <math>P</math>.
Line 43:
===The search algorithm===
The above example is completely instructive in this regard. We assume the existence of a "partial match" table, described below, which indicates where we need to look for the start of a new match in the event that the current one ends in a mismatch. For the moment the table, <math>T</math>, should be taken as a [[black box]] with the property that if we have a match starting at <math>S[m]</math> that fails when comparing <math>S[m + i]</math> to <math>P[i]</math>, then the next possible match will start at <math>m + i - T[i - 1]</math>. In particular, <math>T[-1]</math> is defined and equals <math>-1</math>. Knowing this, the algorithm is very simple:
Line 80 ⟶ 79:
===The efficiency of the search algorithm===
Assuming the prior existence of the table <math>T</math>, the search portion of the Knuth-Morris-Pratt algorithm has [[complexity]] [[Big-O notation|O]]<math>(l)</math>, where <math>l</math> is the length of <math>S</math>. As except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in the central loop, we will calculate a bound on the number of iterations of this loop; in order to do this we first make a key observation about the nature of <math>T</math>. By definition it is constructed so that if a match which had begun at <math>S[m]</math> fails while comparing <math>S[m + i]</math> to <math>P[i]</math>, then the next possible match must begin at <math>S[m + (i - T[i])]</math>. In particular the next possible match must occur at a higher index than <math>m</math>, so that <math>T[i] < i</math>.
Line 88 ⟶ 86:
==The "partial match" table==
The goal of the table is to allow the algorithm not to check any character of the main string more than once. The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an ''initial segment'' of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position. In other words, we "pre-search" the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so.
Line 136 ⟶ 133:
===The table-building algorithm===
The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. Here follows the algorithm; to eliminate special cases we will use the convention that <math>P[-1]</math> is defined and its value is unequal to any possible character in <math>P</math>.
Line 148 ⟶ 144:
===A code example of the table-building algorithm===
A [[C programming language|C]] code example for the table-building algorithm is given here. As for the search algorithm, the bounds of <math>T</math> have been incremented by 1 to make the C code more natural. The additional variable <math>c</math> is employed to simulate the existence of <math>P[-1]</math>. It is also assumed that both this routine and the search routine are [[call]]ed as [[subroutine]]s of a [[wrapper]] which correctly [[Memory allocation|allocates]] [[Computer storage|memory]] for <math>T</math>.
Line 178 ⟶ 173:
===The efficiency of the table-building algorithm===
The complexity of the table algorithm is <math>O(n)</math>, where <math>n</math> is the length of <math>P</math>. As except for some initialization all the work is done in step 3, it is sufficient to show that step 3 executes in <math>O(n)</math> time, which will be done by simultaneously examining the quantities <math>i</math> and <math>i - j</math>. In the first branch, <math>i - j</math> is preserved, as both <math>i</math> and <math>j</math> are incremented simultaneously, but naturally, <math>i</math> is increased. In the second branch, <math>j</math> is replaced by <math>T[j - 1]</math>, which we saw above is always strictly less than <math>j</math>, thus increasing <math>i - j</math>. In the third branch, <math>i</math> is incremented and <math>j</math> is not, so both <math>i</math> and <math>i - j</math> increase. Since <math>i \geq i - j</math>, this means that at each stage either <math>i</math> or a lower bound for <math>i</math> increases; therefore since the algorithm terminates once <math>i = n</math>, it must terminate after at most <math>n</math> iterations of step 3, since <math>i - j</math> begins at <math>1</math>. Therefore the complexity of the table algorithm is <math>O(n)</math>.
==The efficiency of the KMP algorithm==
Since the two portions of the algorithm have, respectively, complexities of <math>O(l)</math> and <math>O(n)</math>, the complexity of the overall algorithm is [[Big-O notation|O]]<math>(n + l)</math>.
==External
*[http://www.ics.uci.edu/~eppstein/161/960227.html An explanation of the algorithm]
|