Knuth–Morris–Pratt algorithm: Difference between revisions

Content deleted Content added
Ryan Reich (talk | contribs)
m Reorder the efficiency calculation and the code example for the table-building algorithm for consistency with the last change.
OAbot (talk | contribs)
m Open access bot: url-access=subscription updated in citation with #oabot.
 
(661 intermediate revisions by more than 100 users not shown)
Line 1:
{{Short description|Algorithm for finding sub-text ___location(s) inside a given sentence in Big O(n) time}}<!--If you are thinking of adding an implementation of this algorithm in a particular language, think again. See the talk page.-->{{Infobox algorithm|name=Knuth–Morris–Pratt algorithm|data=[[String (computer science)|String]]|class=[[String-searching algorithm|String search]]|time=<math>\Theta(m)</math> preprocessing + <math>\Theta(n)</math> matching<ref group="note"><math>m</math> is the length of the pattern, which is the string we are searching for in the text which is of length <math>n</math></ref>|space=<math>\Theta(m)</math>}}
The '''Knuth-Morris-Pratt [[string searching algorithm]]''' searches for occurrences of a "pattern" string <math>P</math> within a main string <math>S</math> by employing the simple observation that when a mismatch occurs, we have enough knowledge simply by possessing the pattern to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
 
In [[computer science]], the '''Knuth–Morris–Pratt algorithm''' (or '''KMP algorithm''') is a [[string-searching algorithm]] that searches for occurrences of a "word" <code>W</code> within a main "text string" <code>S</code> by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
The algorithm was invented by [[Donald Knuth|Knuth]] and [[Vaughan Pratt|Pratt]] and independently by [[J. H. Morris]] in [[1976]]
 
The [[algorithm]] was conceived by [[James H. Morris]] and independently discovered by [[Donald Knuth]] "a few weeks later" from [[automata theory]].<ref name=knuth1977>
==The KMP Algorithm==
{{cite journal|last1=Knuth|first1=Donald|last2=Morris|first2=James H.|last3=Pratt|first3=Vaughan|title=Fast pattern matching in strings|journal=SIAM Journal on Computing|date=1977|volume=6|issue=2|pages=323–350|doi=10.1137/0206024|citeseerx=10.1.1.93.8147}}</ref><ref>
{{cite journal | last1=Knuth | first1=Donald E. | title=The Dangers of Computer-Science Theory | journal=Studies in Logic and the Foundations of Mathematics | volume=74 | year=1973 | pages=189–195 | doi=10.1016/S0049-237X(09)70357-X| isbn=978-0-444-10491-5 }}</ref>
Morris and [[Vaughan Pratt]] published a technical report in 1970.<ref>
{{cite tech report | last1=Morris | first1=J.H. Jr | last2=Pratt | first2=V. | title=A linear pattern-matching algorithm | number=TR-40 | year=1970 | institution=University of California, Berkeley, Computation Center}}</ref>
The three also published the algorithm jointly in 1977.<ref name=knuth1977></ref> Independently, in 1969, [[Yuri Matiyasevich|Matiyasevich]]<ref>{{cite journal
| language = ru
| last1 = Матиясевич
| first1 = Юрий
| title = О распознавании в реальное время отношения вхождения
| journal = Записки научных семинаров Ленинградского отделения Математического института им. В.А.Стеклова
| volume = 20
| year = 1971
| pages = 104–114
| url = http://gdz.sub.uni-goettingen.de/pdfcache/PPN502141972/PPN502141972___LOG_0019.pdf
}}, translated into English as {{cite journal
| language = en
| last1 = Matiyasevich
| first1 = Yuri
| title = Real-time recognition of the inclusion relation
| journal = Journal of Soviet Mathematics
| volume = 1
| year = 1973
| pages = 64–70
| doi = 10.1007/BF01117471
| s2cid = 121919479
| url = http://logic.pdmi.ras.ru/~yumat/Journal/inclusion/inclusion.pdf.gz
| access-date = 2017-07-04
| archive-date = 2021-04-30
| archive-url = https://web.archive.org/web/20210430124227/https://logic.pdmi.ras.ru/~yumat/Journal/inclusion/inclusion.pdf.gz
| url-status = live
}}</ref><ref>Knuth mentions this fact in the errata of his book ''Selected Papers on Design of Algorithms '' : {{quotation|I learned in 2012 that Yuri Matiyasevich had anticipated the linear-time pattern matching and pattern preprocessing algorithms of this paper, in the special case of a binary alphabet, already in 1969. He presented them as constructions for a Turing machine with a two-dimensional working memory.}}</ref> discovered a similar algorithm, coded by a two-dimensional Turing machine, while studying a string-pattern-matching recognition problem over a binary alphabet. This was the first linear-time algorithm for string matching.<ref>{{cite journal
| last1 = Amir | first1 = Amihood
| last2 = Landau | first2 = Gad M.
| last3 = Lewenstein | first3 = Moshe
| last4 = Sokol | first4 = Dina
| doi = 10.1145/1240233.1240242
| issue = 2
| journal = ACM Trans. Algorithms
| page = 19
| title = Dynamic text and static pattern matching
| volume = 3
| year = 2007| s2cid = 8409826
}}</ref>
 
==Background==
To motivate the technical details, we consider a specific instance. In this article, we will be using [[Array#Indices into Arrays|zero-based arrays]] for our strings; thus the 'C' in <math>P</math> will be denoted <math>P[2]</math>. Let <math>m</math> be the start of the currently matched substring within <math>S</math> and <math>i</math> the position within <math>P</math>.
A string-matching algorithm wants to find the starting index <code>m</code> in string <code>S[]</code> that matches the search word <code>W[]</code>.
 
The most straightforward algorithm, known as the "[[Brute-force search|brute-force]]" or "naive" algorithm, is to look for a word match at each index <code>m</code>, i.e. the position in the string being searched that corresponds to the character <code>S[m]</code>. At each position <code>m</code> the algorithm first checks for equality of the first character in the word being searched, i.e. <code>S[m] =? W[0]</code>. If a match is found, the algorithm tests the other characters in the word being searched by checking successive values of the word position index, <code>i</code>. The algorithm retrieves the character <code>W[i]</code> in the word being searched and checks for equality of the expression <code>S[m+i] =? W[i]</code>. If all successive characters match in <code>W</code> at position <code>m</code>, then a match is found at that position in the search string. If the index <code>m</code> reaches the end of the string then there is no match, in which case the search is said to "fail".
 
Usually, the trial check will quickly reject the trial match. If the strings are uniformly distributed random letters, then the chance that characters match is 1 in 26. In most cases, the trial check will reject the match at the initial letter. The chance that the first two letters will match is 1 in 26 (1 in 26^2 chances of a match over 26 possible letters). So if the characters are random, then the expected complexity of searching string <code>S[]</code> of length ''n'' is on the order of ''n'' comparisons or ''Θ''(''n''). The expected performance is very good. If <code>S[]</code> is 1 million characters and <code>W[]</code> is 1000 characters, then the string search should complete after about 1.04 million character comparisons.
 
That expected performance is not guaranteed. If the strings are not random, then checking a trial <code>m</code> may take many character comparisons. The worst case is if the two strings match in all but the last letter. Imagine that the string <code>S[]</code> consists of 1 million characters that are all ''A'', and that the word <code>W[]</code> is 999 ''A'' characters terminating in a final ''B'' character. The simple string-matching algorithm will now examine 1000 characters at each trial position before rejecting the match and advancing the trial position. The simple string search example would now take about 1000 character comparisons times 1 million positions for 1 billion character comparisons. If the length of <code>W[]</code> is ''k'', then the worst-case performance is ''O''(''k''&sdot;''n'').
 
The KMP algorithm has a better worst-case performance than the straightforward algorithm. KMP spends a little time precomputing a table (on the order of the size of <code>W[]</code>, ''O''(''k'')), and then it uses that table to do an efficient search of the string in ''O''(''n'').
 
The difference is that KMP makes use of previous match information that the straightforward algorithm does not. In the example above, when KMP sees a trial match fail on the 1000th character (<code>i</code> = 999) because <code>S[m+999] ≠ W[999]</code>, it will increment <code>m</code> by 1, but it will know that the first 998 characters at the new position already match. KMP matched 999 ''A'' characters before discovering a mismatch at the 1000th character (position 999). Advancing the trial match position <code>m</code> by one throws away the first ''A'', so KMP knows there are 998 ''A'' characters that match <code>W[]</code> and does not retest them; that is, KMP sets <code>i</code> to 998. KMP maintains its knowledge in the precomputed table and two state variables. When KMP discovers a mismatch, the table determines how much KMP will increase (variable <code>m</code>) and where it will resume testing (variable <code>i</code>).
 
==KMP algorithm==
 
===Example of the search algorithm===
To illustrate the algorithm's details, consider a (relatively artificial) run of the algorithm, where <code>W</code> = "ABCDABD" and <code>S</code> = "ABC ABCDAB ABCDABCDABDE". At any given time, the algorithm is in a state determined by two integers:
 
* <code>m</code>, denoting the position within <code>S</code> where the prospective match for <code>W</code> begins,
* <code>i</code>, denoting the index of the currently considered character in <code>W</code>.
 
In each step the algorithm compares <code>S[m+i]</code> with <code>W[i]</code> and increments <code>i</code> if they are equal. This is depicted, at the start of the run, like
 
<div style="font-family:monospace;">
1 2
m: 01234567890123456789012
S: {{color|blue|ABC}}{{color|red| }}{{color|gray|ABCDAB ABCDABCDABDE}}
W: {{color|blue|ABC}}{{color|red|D}}{{color|gray|ABD}}
P: ABCDABD
i: {{color|blue|012}}{{color|red|3}}{{color|gray|456}}
i: 0123456
</div>
 
WeThe beginalgorithm acompares stepsuccessive bycharacters matchingof each<code>W</code> characterto one"parallel" aftercharacters another.of <code>S</code>, Thusmoving infrom theone fourthto step,the <math>mnext =by 0</math> andincrementing <mathcode>i = 3</mathcode> if they match. ButHowever, in the fourth step <mathcode>S[3]&nbsp;=&nbsp;'&nbsp;'</mathcode> isdoes anot space andmatch <mathcode>PW[3] = 'D'</mathcode>, so we have a mismatch. Rather than beginning to search again at <mathcode>m = S[1]</mathcode>, we note that no <code>'A'</code> occurs between positions 01 and 32 in <mathcode>PS</mathcode> except at 0; hence, having checked all those characters previously, we(and knowknowing they matched the corresponding characters in <code>W</code>), there is no chance of finding the beginning of a match if we check them again. Therefore we move on to, the nextalgorithm character, settingsets <mathcode>m = 43</mathcode> and <mathcode>i = 0</mathcode>.
 
<div style="font-family:monospace;">
1 2
m: 01234567890123456789012
S: ABC{{color|red| }}{{color|gray|ABCDAB ABCDABCDABDE}}
W: {{color|red|A}}{{color|gray|BCDABD}}
P: ABCDABD
i: {{color|red|0}}{{color|gray|123456}}
i: 0123456
</div>
 
This match fails at the initial character, so the algorithm sets <code>m = 4</code> and <code>i = 0</code>
We quickly obtain a nearly complete match 'ABCDAB' when, at <math>i = 6</math>, we again have a discrepancy. However, just prior to the end of the current partial match, we passed an 'AB' which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match the two characters prior to the current position, we need not check them again; we simply reset <math>m = 8</math>, <math>i = 2</math> and continue matching the current character.
 
<div style="font-family:monospace;">
1 2
m: 01234567890123456789012
S: ABC {{color|blue|ABCDAB}}{{color|red| }}{{color|gray|ABCDABCDABDE}}
W: {{color|blue|ABCDAB}}{{color|red|D}}
P: ABCDABD
i: {{color|blue|012345}}{{color|red|6}}
i: 0123456
</div>
 
Here, <code>i</code> increments through a nearly complete match <code>"ABCDAB"</code> until <code>i = 6</code> giving a mismatch at <code>W[6]</code> and <code>S[10]</code>. However, just prior to the end of the current partial match, there was that substring <code>"AB"</code> that could be the beginning of a new match, so the algorithm must take this into consideration. As these characters match the two characters prior to the current position, those characters need not be checked again; the algorithm sets <code>m = 8</code> (the start of the initial prefix) and <code>i = 2</code> (signaling the first two characters match) and continues matching. Thus the algorithm not only omits previously matched characters of <code>S</code> (the <code>"AB"</code>), but also previously matched characters of <code>W</code> (the prefix <code>"AB"</code>).
This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we increment <math>m = 11</math>, reset <math>i = 0</math>, and begin searching anew.
 
<div style="font-family:monospace;">
1 2
m: 01234567890123456789012
S: ABC ABCDABABCD{{color|gray|AB}}{{color|red| }}{{color|gray|ABCDABCDABDE}}
W: {{color|gray|AB}}{{color|red|C}}{{color|gray|DABD}}
P: ABCDABD
i: {{color|gray|01}}{{color|red|2}}{{color|gray|3456}}
i: 0123456
</div>
 
This search at the new position fails immediately because <code>W[2]</code> (a <code>'C'</code>) does not match <code>S[10]</code> (a <code>'&nbsp;'</code>). As in the first trial, the mismatch causes the algorithm to return to the beginning of <code>W</code> and begins searching at the mismatched character position of <code>S</code>: <code>m = 10</code>, reset <code>i = 0</code>.
Once again we immediately hit upon a match 'ABCDAB' but the next character, 'C', does not match the final character 'D' of the pattern. Reasoning as before, we set <math>m = 15</math>, to start at the two-character string 'AB' leading up to the current position, set <math>i = 2</math>, and continue matching from the current position.
 
<div style="font-family:monospace;">
1 2
m: 01234567890123456789012
S: ABC ABCDAB{{color|red| }}{{color|gray|ABCDABCDABDE}}
PW: ABCDABD{{color|red|A}}{{color|gray|BCDABD}}
i: 0123456{{color|red|0}}{{color|gray|123456}}
</div>
 
The match at <code>m=10</code> fails immediately, so the algorithm next tries <code>m = 11</code> and <code>i = 0</code>.
This time we are able to complete the match, and return the position 15 as its origin.
 
<div style="font-family:monospace;">
===The search algorithm===
1 2
m: 01234567890123456789012
S: ABC ABCDAB {{color|blue|ABCDAB}}{{color|red|C}}{{color|gray|DABDE}}
W: {{color|blue|ABCDAB}}{{color|red|D}}
i: {{color|blue|012345}}{{color|red|6}}
</div>
 
Once again, the algorithm matches <code>"ABCDAB"</code>, but the next character, <code>'C'</code>, does not match the final character <code>'D'</code> of the word <code>W</code>. Reasoning as before, the algorithm sets <code>m = 15</code>, to start at the two-character string <code>"AB"</code> leading up to the current position, set <code>i = 2</code>, and continue matching from the current position.
The above example is completely instructive in this regard. We assume the existence of a "partial match" table, described below, which indicates where we need to look for the start of a new match in the event that the current one ends in a mismatch. For the moment the table, <math>T</math>, should be taken as a [[black box]] with the property that if we have a match starting at <math>S[m]</math> that fails when comparing <math>S[m + i]</math> to <math>P[i]</math>, then the next possible match will start at <math>m + i - T[i - 1]</math>. In particular, <math>T[-1]</math> is defined and equals <math>-1</math>. Knowing this, the algorithm is very simple:
 
<div style="font-family:monospace;">
# Let <math>i = m = 0</math>; say <math>P</math> has <math>n</math> characters, and <math>S</math>, <math>l</math> characters.
1 2
# If <math>m + i = l</math>, then we exit; there is no match. Otherwise, compare <math>P[i]</math> versus <math>S[m + i]</math>.
m: 01234567890123456789012
#* If they are equal, set <math>i = i + 1</math>. If <math>i = n</math> then we have found a full match; terminate the algorithm and return <math>m</math> as the start of the match.
S: ABC ABCDAB ABCD{{color|blue|ABCDABD}}{{color|gray|E}}
#* If they are unequal, let <math>e = T[i - 1]</math>. Set <math>m = m + i - e</math> and if <math>i > 0</math>, set <math>i = e</math>.
W: {{color|blue|ABCDABD}}
# Return to step 2.
i: {{color|blue|0123456}}
</div>
 
This time the match is complete, and the first character of the match is <code>S[15]</code>.
This clearly implements the algorithm performed in the example; at each mismatch, we consult the table for the next known possible beginning of a new match and update our counters accordingly. Thus we need never backtrack; in particular, each character is matched only once (though potentially ''mis''matched many times; an analysis of the efficiency of this algorithm is given below).
 
===ADescription codeof examplepseudocode offor the search algorithm===
<!--Note to future editors: please do not replace or even supplement the pseudocode with language-specific code. Following the WikiProject Computer Science manual of style, pseudocode is preferred over real code unless the real code illustrates a feature of the language or an implementation detail. This algorithm is so simple that neither of these can ever be the case-->
Sample [[C programming language|C]] code for an implementation of this algorithm is given below. In order to satisfy the natural limitations on the bounds of C arrays, <math>T[i]</math> in the code is equivalent to <math>T[i + 1]</math> above.
The above example contains all the elements of the algorithm. For the moment, we assume the existence of a "partial match" table <code>T</code>, described [[#"Partial match" table (also known as "failure function")|below]], which indicates where we need to look for the start of a new match when a mismatch is found. The entries of <code>T</code> are constructed so that if we have a match starting at <code>S[m]</code> that fails when comparing <code>S[m + i]</code> to <code>W[i]</code>, then the next possible match will start at index <code>m + i - T[i]</code> in <code>S</code> (that is, <code>T[i]</code> is the amount of "backtracking" we need to do after a mismatch). This has two implications: first, <code>T[0] = -1</code>, which indicates that if <code>W[0]</code> is a mismatch, we cannot backtrack and must simply check the next character; and second, although the next possible match will ''begin'' at index <code>m + i - T[i]</code>, as in the example above, we need not actually check any of the <code>T[i]</code> characters after that, so that we continue searching from <code>W[T[i]]</code>. The following is a sample [[pseudocode]] implementation of the KMP search algorithm.
'''algorithm''' ''kmp_search'':
'''input''':
an array of characters, S (the text to be searched)
an array of characters, W (the word sought)
'''output''':
an array of integers, P (positions in S at which W is found)
an integer, nP (number of positions)
'''define variables''':
an integer, j ← 0 (the position of the current character in S)
an integer, k ← 0 (the position of the current character in W)
an array of integers, T (the table, computed elsewhere)
'''let''' nP ← 0
'''while''' j < length(S) '''do'''
'''if''' W[k] = S[j] '''then'''
'''let''' j ← j + 1
'''let''' k ← k + 1
'''if''' k = length(W) '''then'''
(occurrence found, if only first occurrence is needed, m ← j - k may be returned here)
'''let''' P[nP] ← j - k, nP ← nP + 1
'''let''' k ← T[k] (T[length(W)] can't be -1)
'''else'''
'''let''' k ← T[k]
'''if''' k < 0 '''then'''
'''let''' j ← j + 1
'''let''' k ← k + 1
 
===Efficiency of the search algorithm===
int kmp_search(char *P, char *S)
Assuming the prior existence of the table <code>T</code>, the search portion of the Knuth–Morris–Pratt algorithm has [[Computational complexity theory|complexity]] [[Linear time#Linear time|''O''(''n'')]], where ''n'' is the length of <code>S</code> and the ''O'' is [[big-O notation]]. Except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in the <code>'''while'''</code> loop. To bound the number of iterations of this loop; observe that <code>T</code> is constructed so that if a match which had begun at <code>S[m]</code> fails while comparing <code>S[m + i]</code> to <code>W[i]</code>, then the next possible match must begin at <code>S[m + (i - T[i])]</code>. In particular, the next possible match must occur at a higher index than <code>m</code>, so that <code>T[i] < i</code>.
{
extern int T[];
int m = 0;
int i = 0;
while (S[m + i] != '\0' && P[i] != '\0') {
if (S[m + i] == P[i]) {
++i;
} else {
m += i - T[i];
if (i > 0) i = T[i];
}
}
if (P[i] == '\0') {
return m;
} else {
return m + i;
}
}
 
This fact implies that the loop can execute at most 2''n'' times, since at each iteration it executes one of the two branches in the loop. The first branch invariably increases <code>i</code> and does not change <code>m</code>, so that the index <code>m + i</code> of the currently scrutinized character of <code>S</code> is increased. The second branch adds <code>i - T[i]</code> to <code>m</code>, and as we have seen, this is always a positive number. Thus the ___location <code>m</code> of the beginning of the current potential match is increased. At the same time, the second branch leaves <code>m + i</code> unchanged, for <code>m</code> gets <code>i - T[i]</code> added to it, and immediately after <code>T[i]</code> gets assigned as the new value of <code>i</code>, hence <code>new_m + new_i = old_m + old_i - T[old_i] + T[old_i] = old_m + old_i</code>. Now, the loop ends if <code>m + i</code> = ''n''; therefore, each branch of the loop can be reached at most ''n'' times, since they respectively increase either <code>m + i</code> or <code>m</code>, and <code>m ≤ m + i</code>: if <code>m</code> = ''n'', then certainly <code>m + i</code> ≥ ''n'', so that since it increases by unit increments at most, we must have had <code>m + i</code> = ''n'' at some point in the past, and therefore either way we would be done.
===The efficiency of the search algorithm===
 
Thus the loop executes at most 2''n'' times, showing that the time complexity of the search algorithm is ''O''(''n'').
Assuming the prior existence of the table <math>T</math>, the search portion of the Knuth-Morris-Pratt algorithm has [[complexity]] [[Big-O notation|O]]<math>(l)</math>, where <math>l</math> is the length of <math>S</math>. As except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in the central loop, we will calculate a bound on the number of iterations of this loop; in order to do this we first make a key observation about the nature of <math>T</math>. By definition it is constructed so that if a match which had begun at <math>S[m]</math> fails while comparing <math>S[m + i]</math> to <math>P[i]</math>, then the next possible match must begin at <math>S[m + (i - T[i])]</math>. In particular the next possible match must occur at a higher index than <math>m</math>, so that <math>T[i] < i</math>.
 
Here is another way to think about the runtime:
Using this fact, we will show that the loop can execute at most <math>l</math> times. For in each iteration, it executes one of the two branches of the <tt>if</tt> statement. The first branch invariably increases <math>i</math> and does not change <math>m</math>, so that the index <math>m + i</math> of the currently scrutinized character of <math>S</math> is increased. The second branch adds <math>i - T[i]</math> to <math>m</math>, and as we have seen, this is always a positive number. Thus the ___location <math>m</math> of the beginning of the current potential match is increased. Now, the loop ends if <math>S[m + i] = '\backslash 0'</math>, which following the C convention that a null character denotes the end-of-string, means that <math>m + i = l</math>. Therefore each branch of the <tt>if</tt> statement can be reached at most <math>l</math> times, since they respectively increase either <math>m + i</math> or <math>m</math>, and <math>m \leq m + i</math> so that if <math>m = l</math>, then certainly <math>m + i \geq l</math>, so that since it increases by unit increments at most, we must have had <math>m + i = l</math> at some point in the past.
Let us say we begin to match <code>W</code> and <code>S</code> at position <code>i</code> and <code>p</code>. If <code>W</code> exists as a substring of <code>S</code> at p, then <code>W[0..m] = S[p..p+m]</code>.
Upon success, that is, the word and the text matched at the positions (<code>W[i] = S[p+i]</code>), we increase <code>i</code> by 1.
Upon failure, that is, the word and the text do not match at the positions (<code>W[i] ≠ S[p+i]</code>), the text pointer is kept still, while the word pointer is rolled back a certain amount (<code>i = T[i]</code>, where <code>T</code> is the jump table), and we attempt to match <code>W[T[i]]</code> with <code>S[p+i]</code>.
The maximum number of roll-back of <code>i</code> is bounded by <code>i</code>, that is to say, for any failure, we can only roll back as much as we have progressed up to the failure.
Then it is clear the runtime is 2''n''.
 
=="Partial match" table (also known as "failure function")==
Thus the loop executes at most <math>2l</math> times, showing that the time complexity of the search algorithm is <math>O(l)</math>.
The goal of the table is to allow the algorithm not to match any character of <code>S</code> more than once. The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an ''initial segment'' of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position. In other words, we "pre-search" the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so.
 
We want to be able to look up, for each position in <code>W</code>, the length of the longest possible initial segment of <code>W</code> leading up to (but not including) that position, other than the full segment starting at <code>W[0]</code> that just failed to match; this is how far we have to backtrack in finding the next match. Hence <code>T[i]</code> is exactly the length of the longest possible ''proper'' initial segment of <code>W</code> which is also a segment of the substring ending at <code>W[i - 1]</code>. We use the convention that the empty string has length 0. Since a mismatch at the very start of the pattern is a special case (there is no possibility of backtracking), we set <code>T[0] = -1</code>, as discussed [[#Description of pseudocode for the table-building algorithm|below]].
==The "partial match" table==
 
===Working example of the table-building algorithm===
The goal of the table is to allow the algorithm not to check any character of the main string more than once. The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an ''initial segment'' of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position. In other words, we "pre-search" the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so.
We consider the example of <code>W = "ABCDABD"</code> first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. We set <code>T[0] = -1</code>. To find <code>T[1]</code>, we must discover a [[Substring#Suffix|proper suffix]] of <code>"A"</code> which is also a prefix of pattern <code>W</code>. But there are no proper suffixes of <code>"A"</code>, so we set <code>T[1] = 0</code>. To find <code>T[2]</code>, we see that the substring <code>W[0]</code> - <code>W[1]</code> (<code>"AB"</code>) has a proper suffix <code>"B"</code>. However "B" is not a prefix of the pattern <code>W</code>. Therefore, we set <code>T[2] = 0</code>.
 
Continuing to <code>T[3]</code>, we first check the proper suffix of length 1, and as in the previous case it fails. Should we also check longer suffixes? No, we now note that there is a shortcut to checking ''all'' suffixes: let us say that we discovered a [[Substring#Suffix|proper suffix]] which is a [[Substring#Prefix|proper prefix]] (a proper prefix of a string is not equal to the string itself) and ending at <code>W[2]</code> with length 2 (the maximum possible); then its first character is also a proper prefix of <code>W</code>, hence a proper prefix itself, and it ends at <code>W[1]</code>, which we already determined did not occur as <code>T[2] = 0</code> and not <code>T[2] = 1</code>. Hence at each stage, the shortcut rule is that one needs to consider checking suffixes of a given size m+1 only if a valid suffix of size m was found at the previous stage (i.e. <code>T[x] = m</code>) and should not bother to check m+2, m+3, etc.
We want to be able to look up, for each pattern position, the length of the longest possible initial match that ends in the current position other than the full match that probably just failed. Hence <math>T[i]</math> is exactly the length of the longest possible initial segment of <math>P</math> which is also a terminal segment of the substring ending at <math>P[i]</math>. We use the convention that the empty string has length 0. Since a mismatch at the very start of the pattern is a special case (there is no possibility of backtracking), we set <math>T[-1] = -1</math>, as discussed above.
 
Therefore, we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so <code>T[3] = 0</code>.
We consider the example of 'ABCDABD' first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. We set <math>T[-1] = 0</math>. Since <math>P[0]</math> is at the end of only the complete initial segment, we set <math>T[0] = 0</math> as well. To find <math>T[1]</math>, we must discover a proper terminal substring of 'AB' which is also an initial substring of the pattern. But the only proper terminal substring of 'AB' is 'B', which is not an initial substring of the pattern, so we set <math>T[1] = 0</math>.
 
We pass to the subsequent <code>W[4]</code>, <code>'A'</code>. The same logic shows that the longest substring we need to consider has length 1, and as in the previous case it fails since "D" is not a prefix of <code>W</code>. But instead of setting <code>T[4] = 0</code>, we can do better by noting that <code>W[4] = W[0]</code>, and also that a look-up of <code>T[4]</code> implies the corresponding <code>S</code> character, <code>S[m+4]</code>, was a mismatch and therefore <code>S[m+4] ≠ 'A'</code>. Thus there is no point in restarting the search at <code>S[m+4]</code>; we should begin at 1 position ahead. This means that we may shift pattern <code>W</code> by match length plus one character, so <code>T[4] = -1</code>.
Continuing to 'C', we note that there is a shortcut to checking ''all'' terminal substrings: let us say that we discovered a terminal substring ending at 'C' with length 2; then its first character is an initial substring of an initial substring of the pattern, hence an initial substring itself...and it ends at 'B', which we already determined cannot occur. Therefore we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so <math>T[2] = 0</math>.
 
Considering now the next character, <code>W[5]</code>, which is <code>'B'</code>: though by inspection the longest substring would appear to be <code>'A'</code>, we still set <code>T[5] = 0</code>. The reasoning is similar to why <code>T[4] = -1</code>. <code>W[5]</code> itself extends the prefix match begun with <code>W[4]</code>, and we can assume that the corresponding character in <code>S</code>, <code>S[m+5] ≠ 'B'</code>. So backtracking before <code>W[5]</code> is pointless, but <code>S[m+5]</code> may be <code>'A'</code>, hence <code>T[5] = 0</code>.
Similarly for 'D', giving <math>T[3] = 0</math>, so we pass to the subsequent 'A'. The same logic shows that the longest substring we need consider has length 1, and in this case 'A' ''does'' work; thus <math>T[4] = 1</math>. Considering now the following 'B', we exercise the following logic: if we were to find a subpattern beginning before the previous 'A' yet continuing to the current 'B', then in particular it would itself have a proper initial segment ending at 'A' yet beginning before 'A', which contradicts the fact that we already found that 'A' itself is the earliest occurrence of a proper subpattern ending at it. Therefore we need not look before that 'A' to find a subpattern for 'B'. In fact, checking it, we find that it continues to 'B' and that 'B' is the second entry of the substring of which 'A' is the first. Therefore the entry for 'B' in 'T' is one more than the entry for 'A', namely <math>T[5] = 2</math>.
 
Finally, we see that the next character in the ongoing segment starting at <code>W[4] = 'A'</code> would be <code>'B'</code>, and indeed this is also <code>W[5]</code>. Furthermore, the same argument as above shows that we need not look before <code>W[4]</code> to find a segment for <code>W[6]</code>, so that this is it, and we take <code>T[6] = 2</code>.
Finally, the pattern does ''not'' continue from 'B' to 'D'. Similar reasoning as above shows that if we were to find a subpattern longer than 1 at 'D' then it must comprise a subpattern ending at 'B'; since the current one does not work, this one would have to be shorter. But the current one is an initial segment of the pattern ending at the second position, so this potential new subpattern would be as well, and we already decided there were none of those. Since 'D' itself is not a subpattern, <math>T[6] = 0</math>.
 
Therefore, we compile the following table:
 
{| class="wikitable" style="background-color:white; font-family:monospace; text-align:right"
{| border="1" cellpadding="2"
!<mathcode>i</mathcode>
| -1
| 0
| 1
Line 113 ⟶ 216:
| 5
| 6
| 7
|-
!<mathcode>PW[i]</mathcode>
|
| A
| B
Line 123 ⟶ 226:
| B
| D
|
|-
!<mathcode>T[i]</mathcode>
| -1
| 0
| 0
| 0
| -1
| 0
| 2
| 0
|}
 
Another example:
{| class="wikitable" style="background-color:white; font-family:monospace; text-align:right"
!<code>i</code>
| 0
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
|-
!<code>W[i]</code>
| A
| B
| A
| C
| A
| B
| A
| B
| C
|
|-
!<code>T[i]</code>
| -1
| 0
| -1
| 1
| -1
| 0
| -1
| 3
| 2
| 0
|}
 
Another example (slightly changed from the previous example):
===The table-building algorithm===
{| class="wikitable" style="background-color:white; font-family:monospace; text-align:right"
!<code>i</code>
| 0
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
|-
!<code>W[i]</code>
| A
| B
| A
| C
| A
| B
| A
| B
| A
|
|-
!<code>T[i]</code>
| -1
| 0
| -1
| 1
| -1
| 0
| -1
| 3
| -1
| 3
|}
 
Another more complicated example:
{| class="wikitable" style="background-color:white; font-family:monospace; text-align:right"
!<code>i</code>
| 00
| 01
| 02
| 03
| 04
| 05
| 06
| 07
| 08
| 09
| 10
| 11
| 12
| 13
| 14
| 15
| 16
| 17
| 18
| 19
| 20
| 21
| 22
| 23
| 24
|-
!<code>W[i]</code>
| P
| A
| R
| T
| I
| C
| I
| P
| A
| T
| E
|
| I
| N
|
| P
| A
| R
| A
| C
| H
| U
| T
| E
|
|-
!<code>T[i]</code>
| -1
| 0
| 0
| 0
| 0
| 0
| 0
| -1
| 0
| 2
| 0
| 0
| 0
| 0
| 0
| -1
| 0
| 0
| 3
| 0
| 0
| 0
| 0
| 0
| 0
|}
 
===Description of pseudocode for the table-building algorithm===
<!--Note to future editors: please do not replace or even supplement the pseudocode with language-specific code. Following the WikiProject Computer Science manual of style, pseudocode is preferred over real code unless the real code illustrates a feature of the language or an implementation detail. This algorithm is so simple that neither of these can ever be the case-->
<!--This code appears to implement the table for the Morris-Pratt algorithm, not for the Knuth-Morris-Pratt Algorithm. The KMP table has bigger shifts. See http://www-igm.univ-mlv.fr/~lecroq/string/node8.html-->
The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. The only minor complication is that the logic which is correct late in the string erroneously gives non-proper substrings at the beginning. This necessitates some initialization code.
'''algorithm''' ''kmp_table'':
'''input''':
an array of characters, W (the word to be analyzed)
'''output''':
an array of integers, T (the table to be filled)
'''define variables''':
an integer, pos ← 1 (the current position we are computing in T)
an integer, cnd ← 0 (the zero-based index in W of the next character of the current candidate substring)
'''let''' T[0] ← -1
'''while''' pos < length(W) '''do'''
'''if''' W[pos] = W[cnd] '''then'''
'''let''' T[pos] ← T[cnd]
'''else'''
'''let''' T[pos] ← cnd
'''while''' cnd ≥ 0 '''and''' W[pos] ≠ W[cnd] '''do'''
'''let''' cnd ← T[cnd]
'''let''' pos ← pos + 1, cnd ← cnd + 1
'''let''' T[pos] ← cnd (only needed when all word occurrences are searched)
 
===Efficiency of the table-building algorithm===
 
The time (and space) complexity of the table algorithm is <math>O(k)</math>, where <math>k</math> is the length of <code>W</code>.
 
* The outer loop: <code>pos</code> is initialized to 1, the loop condition is <code>pos < k</code>, and <code>pos</code> is increased by 1 in every iteration of the loop. Thus the loop will take <math>k - 1</math> iterations.
 
* The inner loop: <code>cnd</code> is initialized to <code>0</code> and gets increased by at most 1 in each outer loop iteration. <code>T[cnd]</code> is always less than <code>cnd</code>, so <code>cnd</code> gets decreased by at least 1 in each inner loop iteration; the inner loop condition is <code>cnd ≥ 0</code>. This means that the inner loop can execute at most as many times in total, as the outer loop has executed – each decrease of <code>cnd</code> by 1 in the inner loop needs to have a corresponding increase by 1 in the outer loop. Since the outer loop takes <math>k - 1</math> iterations, the inner loop can take no more than <math>k - 1</math> iterations in total.
The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. Here follows the algorithm; to eliminate special cases we will use the convention that <math>P[-1]</math> is defined and its value is unequal to any possible character in <math>P</math>.
 
#Combined, Setthe <math>T[-1]outer =and -1</math>,inner andloops lettake at most <math>P2k-2</math> haveiterations. This corresponds to <math>nO(k)</math> characterstime complexity using the [[Big O notation]].
# Set <math>i = 0</math>, <math>j = T[i - 1]</math>.
# If <math>i = n</math> then we are done; terminate the algorithm. Otherwise, compare <math>P[i]</math> with <math>P[j]</math>.
#* If they are equal, set <math>T[i] = j + 1</math>, <math>j = j + 1</math>, and <math>i = i + 1</math>.
#* If not, and if <math>j > 0</math>, set <math>j = T[j - 1]</math>.
#* Otherwise, set <math>T[i] = 0</math>, <math>i = i + 1</math>, <math>j = 0</math>.
# Return to step 3.
 
===A code exampleEfficiency of the table-buildingKMP algorithm===
Since the two portions of the algorithm have, respectively, complexities of <code>O(k)</code> and <code>O(n)</code>, the complexity of the overall algorithm is <code>O(n + k)</code>.
 
These complexities are independent of how many repetitive patterns are in <code>W</code> or <code>S</code>.
A [[C programming language|C]] code example for the table-building algorithm is given here. As for the search algorithm, the bounds of <math>T</math> have been incremented by 1 to make the C code more natural. The additional variable <math>c</math> is employed to simulate the existence of <math>P[-1]</math>. It is also assumed that both this routine and the search routine are [[call]]ed as [[subroutine]]s of a [[wrapper]] which correctly [[Memory allocation|allocates]] [[Computer storage|memory]] for <math>T</math>.
 
It is known that the delay, that is the number of times a symbol of the text is compared to symbols of the pattern, is less than
void kmp_table(char *P)
<math>\lfloor \log_\Phi(k+1)\rfloor</math>, where Φ is the golden ration <math>(1+\sqrt 5)/2</math>. In 1993, an algorithm was given
{
that has a delay bounded by <math>\min(1+\lfloor \log_2 k\rfloor,|\Sigma|)</math> where Σ is the size of the alphabet (of the pattern).<ref>{{Cite conference
extern int T[];
| last = Simon
int i = 0;
| int jfirst = -1;Imre
| title = String matching algorithms and automata
char c;
| book-title = Results and Trends in Theoretical Computer Science: Colloquium in Honor of Arto Salomaa
| ceditor = '\0';
| publisher = Springer
T[0] = j;
| ___location =
while (P[i] != '\0') {
| pages = 386-395
if (P[i] == c) {
| date =
T[i + 1] = j + 1;
| year = 1994
++j;
| url =
++i;
| doi =
} else if (j > 0) {
}}</ref><ref>{{Cite journal
j = T[j];
| last = Hancart
} else {
| first = Christophe
T[i + 1] = 0;
| title = On Simon's String Searching Algorithm
++i;
| journal = Information Processing Letters
j = 0;
| volume = 47
}
| issue = 2
c = P[j];
| pages = 65-99
}
| date =
| year = 1993
return;
| doi = 10.1016/0020-0190(93)90231-W
}
| pmid =
| url = https://doi.org/10.1016/0020-0190(93)90231-W
| url-access = subscription
}}</ref>
 
==Variants==
===The efficiency of the table-building algorithm===
A [[Real-time computing|real-time]] version of KMP can be implemented using a separate failure function table for each character in the alphabet. If a mismatch occurs on character <math>x</math> in the text, the failure function table for character <math>x</math> is consulted for the index <math>i</math> in the pattern at which the mismatch took place. This will return the length of the longest substring ending at <math>i</math> matching a prefix of the pattern, with the added condition that the character after the prefix is <math>x</math>. With this restriction, character <math>x</math> in the text need not be checked again in the next phase, and so only a constant number of operations are executed between the processing of each index of the text{{Citation needed|date=July 2017}}. This satisfies the real-time computing restriction.
 
[[Lexicographically_minimal_string_rotation#Booth's_Algorithm|Booth's algorithm]] uses a modified version of the KMP preprocessing function to find the [[lexicographically minimal string rotation]]. The failure function is progressively calculated as the string is rotated.
The complexity of the table algorithm is <math>O(n)</math>, where <math>n</math> is the length of <math>P</math>. As except for some initialization all the work is done in step 3, it is sufficient to show that step 3 executes in <math>O(n)</math> time, which will be done by simultaneously examining the quantities <math>i</math> and <math>i - j</math>. In the first branch, <math>i - j</math> is preserved, as both <math>i</math> and <math>j</math> are incremented simultaneously, but naturally, <math>i</math> is increased. In the second branch, <math>j</math> is replaced by <math>T[j - 1]</math>, which we saw above is always strictly less than <math>j</math>, thus increasing <math>i - j</math>. In the third branch, <math>i</math> is incremented and <math>j</math> is not, so both <math>i</math> and <math>i - j</math> increase. Since <math>i \geq i - j</math>, this means that at each stage either <math>i</math> or a lower bound for <math>i</math> increases; therefore since the algorithm terminates once <math>i = n</math>, it must terminate after at most <math>n</math> iterations of step 3, since <math>i - j</math> begins at <math>1</math>. Therefore the complexity of the table algorithm is <math>O(n)</math>.
 
==Notes==
==The efficiency of the KMP algorithm==
{{reflist|group=note}}
 
==References==
Since the two portions of the algorithm have, respectively, complexities of <math>O(l)</math> and <math>O(n)</math>, the complexity of the overall algorithm is [[Big-O notation|O]]<math>(n + l)</math>.
{{Reflist}}
* {{cite book | first1=Thomas | last1=Cormen | author1-link=Thomas H. Cormen | first2=Charles E. | last2=Leiserson | author2-link=Charles E. Leiserson | first3=Ronald L. | last3=Rivest | author3-link=Ronald L. Rivest | first4=Clifford | last4=Stein | author4-link=Clifford Stein | title = Introduction to Algorithms | url=https://archive.org/details/introductiontoal00corm_691 | url-access=limited | edition = Second | publisher = MIT Press and McGraw-Hill | year = 2001 | isbn = 0-262-03293-7 | chapter = Section 32.4: The Knuth-Morris-Pratt algorithm | pages = [https://archive.org/details/introductiontoal00corm_691/page/n945 923]–931 | zbl=1047.68161 }}
* {{cite book | last1=Crochemore | first1=Maxime | last2=Rytter | first2=Wojciech | author2-link = Wojciech Rytter | title=Jewels of stringology. Text algorithms | ___location=River Edge, NJ | publisher=World Scientific | year=2003 | isbn=981-02-4897-0 | zbl=1078.68151 | pages=20–25|title-link= Jewels of Stringology }}
* {{cite book | last=Szpankowski | first=Wojciech | author-link = Wojciech Szpankowski | title=Average case analysis of algorithms on sequences | others=With a foreword by Philippe Flajolet | series=Wiley-Interscience Series in Discrete Mathematics and Optimization | ___location=Chichester | publisher=Wiley | year=2001 | isbn=0-471-24063-X | zbl=0968.68205 | pages=15–17,136–141 }}
 
==External links==
{{wikibooks|Algorithm implementation|String searching/Knuth-Morris-Pratt pattern matcher}}
*[http://www.ics.uci.edu/~eppstein/161/960227.html An explanation of the algorithm]
* [https://people.cs.pitt.edu/~kirk/cs1501/animations/String.html String Searching Applet animation]
* [https://ics.uci.edu/~eppstein/161/960227.html An explanation of the algorithm] and [https://ics.uci.edu/~eppstein/161/kmp/ sample C++ code] by [[David Eppstein]]
* [http://www-igm.univ-mlv.fr/~lecroq/string/node8.html Knuth-Morris-Pratt algorithm] description and C code by Christian Charras and Thierry Lecroq
* [https://hwlang.de/algorithmen/pattern/kmpen.htm Explanation of the algorithm from scratch] by H.W. Lang
* [https://web.archive.org/web/20101227102334/http://oak.cs.ucla.edu/cs144/examples/KMPSearch.html Breaking down steps of running KMP] by Chu-Cheng Hsieh.
* [https://www.youtube.com/watch?v=Zj_er99KMb8 NPTELHRD YouTube lecture video]
* [https://www.youtube.com/watch?v=4jY57Ehc14Y LogicFirst YouTube lecture video]
* [http://toccata.lri.fr/gallery/kmp.en.html Proof of correctness]
* [http://www.avhohlov.narod.ru/p2250en.htm Transformation between different forms of algorithm] {{Webarchive|url=https://web.archive.org/web/20230707162253/http://www.avhohlov.narod.ru/p2250en.htm|date=July 7, 2023}}
* [https://github.com/rvhuang/kmp-algorithm Knuth-Morris-Pratt algorithm written in C#]
* [https://www.w3spot.com/2020/07/kmp-algorithm-explained-in-plain-english.html KMP algorithm search time complexity explained in plain English]
 
{{Strings}}
[[Category:Algorithms on strings]]
{{Donald Knuth navbox}}
 
[[es{{DEFAULTSORT:Algoritmo Knuth-Morris-Pratt]] Algorithm}}
[[Category:String matching algorithms]]
[[Category:Donald Knuth]]
[[Category:Articles with example pseudocode]]
[[Category:1970 in computing]]