Knuth–Morris–Pratt algorithm: Difference between revisions

Content deleted Content added
supply requested citation
OAbot (talk | contribs)
m Open access bot: url-access=subscription updated in citation with #oabot.
 
(48 intermediate revisions by 42 users not shown)
Line 1:
{{Short description|Algorithm for finding sub-text ___location(s) inside a given sentence in Big O(n) time}}<!--If you are thinking of adding an implementation of this algorithm in a particular language, think again. See the talk page.-->{{Infobox algorithm|name=Knuth–Morris–Pratt algorithm|data=[[String (computer science)|String]]|class=[[String-searching algorithm|String search]]|time=<math>\Theta(m)</math> preprocessing + <math>\Theta(n)</math> matching<ref group="note"><math>m</math> is the length of the pattern, which is the string we are searching for in the text which is of length <math>n</math></ref>|space=<math>\Theta(m)</math>}}
In [[computer science]], the '''Knuth–Morris–Pratt [[string-searching algorithm]]''' (or '''KMP algorithm''') searches for occurrences of a "word" <code>W</code> within a main "text string" <code>S</code> by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
 
In [[computer science]], the '''Knuth–Morris–Pratt [[string-searching algorithm]]''' (or '''KMP algorithm''') is a [[string-searching algorithm]] that searches for occurrences of a "word" <code>W</code> within a main "text string" <code>S</code> by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
The [[algorithm]] was conceived by [[James H. Morris]] and independently discovered by [[Donald Knuth]] "a few weeks later" from automata theory.<ref name=knuth1977>
 
The [[algorithm]] was conceived by [[James H. Morris]] and independently discovered by [[Donald Knuth]] "a few weeks later" from [[automata theory]].<ref name=knuth1977>
{{cite journal|last1=Knuth|first1=Donald|last2=Morris|first2=James H.|last3=Pratt|first3=Vaughan|title=Fast pattern matching in strings|journal=SIAM Journal on Computing|date=1977|volume=6|issue=2|pages=323–350|doi=10.1137/0206024|citeseerx=10.1.1.93.8147}}</ref><ref>
{{cite journal | last1=Knuth | first1=Donald E. | title=The Dangers of Computer-Science Theory | journal=Studies in Logic and the Foundations of Mathematics | volume=74 | year=1973 | pages=189–195 | doi=10.1016/S0049-237X(09)70357-X| isbn=9780444104915978-0-444-10491-5 }}</ref>
Morris and [[Vaughan Pratt]] published a technical report in 1970.<ref>
{{cite techreporttech report | last1=Morris | first1=J.H., Jr | last2=Pratt | first2=V. | title=A linear pattern-matching algorithm | number=TR-40 | year=1970 | institution=University of California, Berkeley, Computation Center}}</ref>
The three also published the algorithm jointly in 1977.<ref name=knuth1977></ref> Independently, in 1969, [[Yuri Matiyasevich|Matiyasevich]]<ref>{{cite journal
| language = ru
| last1 = Матиясевич
| first1 = Юрий
| title = О распознавании в реальное время отношения вхождения
| journal = Записки научных семинаров Ленинградского отделения Математического института им. В.А.Стеклова
| volume = 20
| year = 1971
| pages = 104–114
Line 20 ⟶ 21:
| language = en
| last1 = Matiyasevich
| first1 = Yuri
| title = Real-time recognition of the inclusion relation
| journal = Journal of Soviet Mathematics
| volume = 1
| year = 1973
| pages = 64–70
| doi = 10.1007/BF01117471
| s2cid = 121919479
| url = http://logic.pdmi.ras.ru/~yumat/Journal/inclusion/inclusion.pdf.gz
| access-date = 2017-07-04
| archive-date = 2021-04-30
| archive-url = https://web.archive.org/web/20210430124227/https://logic.pdmi.ras.ru/~yumat/Journal/inclusion/inclusion.pdf.gz
| url-status = live
}}</ref><ref>Knuth mentions this fact in the errata of his book ''Selected Papers on Design of Algorithms '' : {{quotation|I learned in 2012 that Yuri Matiyasevich had anticipated the linear-time pattern matching and pattern preprocessing algorithms of this paper, in the special case of a binary alphabet, already in 1969. He presented them as constructions for a Turing machine with a two-dimensional working memory.}}</ref> discovered a similar algorithm, coded by a two-dimensional Turing machine, while studying a string-pattern-matching recognition problem over a binary alphabet. This was the first linear-time algorithm for string matching.<ref>{{cite journal
| last1 = Amir | first1 = Amihood
Line 39 ⟶ 45:
| title = Dynamic text and static pattern matching
| volume = 3
| year = 2007}}</ref>| s2cid = 8409826
}}</ref>
 
==Background==
A string-matching algorithm wants to find the starting index <code>m</code> in string <code>S[]</code> that matches the search word <code>W[]</code>.
 
The most straightforward algorithm, known as the "[[Brute-force search|Brutebrute-force]]" or "Naivenaive" algorithm, is to look for a word match at each index <code>m</code>, i.e. the position in the string being searched, i.e.that corresponds to the character <code>S[m]</code>. At each position <code>m</code> the algorithm first checks for equality of the first character in the word being searched, i.e. <code>S[m] =? W[0]</code>. If a match is found, the algorithm tests the other characters in the word being searched by checking successive values of the word position index, <code>i</code>. The algorithm retrieves the character <code>W[i]</code> in the word being searched and checks for equality of the expression <code>S[m+i] =? W[i]</code>. If all successive characters match in <code>W</code> at position <code>m</code>, then a match is found at that position in the search string. If the index <code>m</code> reaches the end of the string then there is no match, in which case the search is said to "fail".
 
Usually, the trial check will quickly reject the trial match. If the strings are uniformly distributed random letters, then the chance that characters match is 1 in 26. In most cases, the trial check will reject the match at the initial letter. The chance that the first two letters will match is 1 in 26<sup>2</sup> (1 in 67626^2 chances of a match over 26 possible letters). So if the characters are random, then the expected complexity of searching string <code>S[]</code> of length ''n'' is on the order of ''n'' comparisons or ''OΘ''(''n''). The expected performance is very good. If <code>S[]</code> is 1 million characters and <code>W[]</code> is 1000 characters, then the string search should complete after about 1.04 million character comparisons.
 
That expected performance is not guaranteed. If the strings are not random, then checking a trial <code>m</code> may take many character comparisons. The worst case is if the two strings match in all but the last letter. Imagine that the string <code>S[]</code> consists of 1 million characters that are all ''A'', and that the word <code>W[]</code> is 999 ''A'' characters terminating in a final ''B'' character. The simple string-matching algorithm will now examine 1000 characters at each trial position before rejecting the match and advancing the trial position. The simple string search example would now take about 1000 character comparisons times 1 million positions for 1 billion character comparisons. If the length of <code>W[]</code> is ''k'', then the worst-case performance is ''O''(''k''&sdot;''n'').
Line 136 ⟶ 143:
===Description of pseudocode for the search algorithm===
<!--Note to future editors: please do not replace or even supplement the pseudocode with language-specific code. Following the WikiProject Computer Science manual of style, pseudocode is preferred over real code unless the real code illustrates a feature of the language or an implementation detail. This algorithm is so simple that neither of these can ever be the case-->
The above example contains all the elements of the algorithm. For the moment, we assume the existence of a "partial match" table <code>T</code>, described [[#"Partial match" table (also known as "failure function")|below]], which indicates where we need to look for the start of a new match inwhen thea eventmismatch thatis the current one ends in a mismatchfound. The entries of <code>T</code> are constructed so that if we have a match starting at <code>S[m]</code> that fails when comparing <code>S[m + i]</code> to <code>W[i]</code>, then the next possible match will start at index <code>m + i - T[i]</code> in <code>S</code> (that is, <code>T[i]</code> is the amount of "backtracking" we need to do after a mismatch). This has two implications: first, <code>T[0] = -1</code>, which indicates that if <code>W[0]</code> is a mismatch, we cannot backtrack and must simply check the next character; and second, although the next possible match will ''begin'' at index <code>m + i - T[i]</code>, as in the example above, we need not actually check any of the <code>T[i]</code> characters after that, so that we continue searching from <code>W[T[i]]</code>. The following is a sample [[pseudocode]] implementation of the KMP search algorithm.
 
'''algorithm''' ''kmp_search'':
'''input''':
Line 168 ⟶ 172:
'''let''' j ← j + 1
'''let''' k ← k + 1
 
 
===Efficiency of the search algorithm===
Line 190 ⟶ 193:
 
===Working example of the table-building algorithm===
We consider the example of <code>W = "ABCDABD"</code> first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. We set <code>T[0] = -1</code>. To find <code>T[1]</code>, we must discover a [[Substring#Suffix|proper suffix]] of <code>"A"</code> which is also a prefix of pattern <code>W</code>. But there are no proper suffixes of <code>"A"</code>, so we set <code>T[1] = 0</code>. To find <code>T[2]</code>, we see that the substring <code>W[0]</code> - <code>W[1]</code> (<code>"AB"</code>) has a proper suffix <code>"B"</code>. However "B" is not a prefix of the pattern <code>W</code>. Therefore, we set <code>T[2] = 0</code>.
 
Continuing to <code>T[3]</code>, we first check the proper suffix of length 1, and as in the previous case it fails. Should we also check longer suffixes? No, we now note that there is a shortcut to checking ''all'' suffixes: let us say that we discovered a [[Substring#Suffix|proper suffix]] which is a [[Substring#Prefix|proper prefix]] (Aa proper prefix of a string is not equal to the string itself) and ending at <code>W[2]</code> with length 2 (the maximum possible); then its first character is also a proper prefix of <code>W</code>, hence a proper prefix itself, and it ends at <code>W[1]</code>, which we already determined did not occur as <code>T[2] = 0</code> and not <code>T[2] = 1</code>. Hence at each stage, the shortcut rule is that one needs to consider checking suffixes of a given size m+1 only if a valid suffix of size m was found at the previous stage (i.e. <code>T[x] = m</code>) and should not bother to check m+2, m+3, etc.
 
Therefore, we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so <code>T[3] = 0</code>.
Line 399 ⟶ 402:
 
===Description of pseudocode for the table-building algorithm===
<!--Note to future editors: please do not replace or even supplement the pseudocode with language-specific code. Following the WikiProject Computer SscienceScience manual of style, pseudocode is preferred over real code unless the real code illustrates a feature of the language or an implementation detail. This algorithm is so simple that neither of these can ever be the case-->
<!--This code appears to implement the table for the Morris-Pratt algorithm, not for the Knuth-Morris-Pratt Algorithm. The KMP table has bigger shifts. See http://www-igm.univ-mlv.fr/~lecroq/string/node8.html-->
The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. The only minor complication is that the logic which is correct late in the string erroneously gives non-proper substrings at the beginning. This necessitates some initialization code.
Line 420 ⟶ 423:
'''else'''
'''let''' T[pos] ← cnd
'''letwhile''' cnd T0 '''and''' W[cndpos] (to increaseW[cnd] performance)'''do'''
'''while''' cnd ≥ 0 '''and''' W[pos] <> W[cnd] '''do'''
'''let''' cnd ← T[cnd]
'''let''' pos ← pos + 1, cnd ← cnd + 1
'''let''' T[pos] ← cnd (only needneeded when all word occurrences are searched)
 
===Efficiency of the table-building algorithm===
 
The complexity of the table algorithm is <code>O(k)</code>, where <code>k</code> is the length of <code>W</code>. This is easy to prove, <code>pos</code> is initialized as 1, and the loop condition is <code>pos < k</code>. <code>pos</code> is increased by 1 every iteration of the loop, thus the loop will take <code>k - 1</code> iterations, which corresponds to an algorithmic efficiency of <code>O(k)</code> using the [[Big O notation]].
The time (and space) complexity of the table algorithm is <math>O(k)</math>, where <math>k</math> is the length of <code>W</code>.
 
* The complexityouter of the table algorithm is <code>O(k)</code>, where <code>k</code> is the length of <code>W</code>. This is easy to prove,loop: <code>pos</code> is initialized asto 1, and the loop condition is <code>pos < k</code>., and <code>pos</code> is increased by 1 in every iteration of the loop,. thusThus the loop will take <codemath>k - 1</codemath> iterations, which corresponds to an algorithmic efficiency of <code>O(k)</code> using the [[Big O notation]].
 
* The inner loop: <code>cnd</code> is initialized to <code>0</code> and gets increased by at most 1 in each outer loop iteration. <code>T[cnd]</code> is always less than <code>cnd</code>, so <code>cnd</code> gets decreased by at least 1 in each inner loop iteration; the inner loop condition is <code>cnd ≥ 0</code>. This means that the inner loop can execute at most as many times in total, as the outer loop has executed – each decrease of <code>cnd</code> by 1 in the inner loop needs to have a corresponding increase by 1 in the outer loop. Since the outer loop takes <math>k - 1</math> iterations, the inner loop can take no more than <math>k - 1</math> iterations in total.
 
Combined, the outer and inner loops take at most <math>2k-2</math> iterations. This corresponds to <math>O(k)</math> time complexity using the [[Big O notation]].
 
==Efficiency of the KMP algorithm==
Since the two portions of the algorithm have, respectively, complexities of <code>O(k)</code> and <code>O(n)</code>, the complexity of the overall algorithm is <code>O(n + k)</code>.
 
These complexities are theindependent same, no matterof how many repetitive patterns are in <code>W</code> or <code>S</code>.
 
It is known that the delay, that is the number of times a symbol of the text is compared to symbols of the pattern, is less than
<math>\lfloor \log_\Phi(k+1)\rfloor</math>, where Φ is the golden ration <math>(1+\sqrt 5)/2</math>. In 1993, an algorithm was given
that has a delay bounded by <math>\min(1+\lfloor \log_2 k\rfloor,|\Sigma|)</math> where Σ is the size of the alphabet (of the pattern).<ref>{{Cite conference
| last = Simon
| first = Imre
| title = String matching algorithms and automata
| book-title = Results and Trends in Theoretical Computer Science: Colloquium in Honor of Arto Salomaa
| editor =
| publisher = Springer
| ___location =
| pages = 386-395
| date =
| year = 1994
| url =
| doi =
}}</ref><ref>{{Cite journal
| last = Hancart
| first = Christophe
| title = On Simon's String Searching Algorithm
| journal = Information Processing Letters
| volume = 47
| issue = 2
| pages = 65-99
| date =
| year = 1993
| doi = 10.1016/0020-0190(93)90231-W
| pmid =
| url = https://doi.org/10.1016/0020-0190(93)90231-W
| url-access = subscription
}}</ref>
 
==Variants==
A [[Real-time computing|real-time]] version of KMP can be implemented using a separate failure function table for each character in the alphabet. If a mismatch occurs on character <math>x</math> in the text, the failure function table for character <math>x</math> is consulted for the index <math>i</math> in the pattern at which the mismatch took place. This will return the length of the longest substring ending at <math>i</math> matching a prefix of the pattern, with the added condition that the character after the prefix is <math>x</math>. With this restriction, character <math>x</math> in the text need not be checked again in the next phase, and so only a constant number of operations are executed between the processing of each index of the text{{Citation needed|date=July 2017}}. This satisfies the real-time computing restriction.
 
The [[Lexicographically_minimal_string_rotation#Booth's_Algorithm|Booth's algorithm]] uses a modified version of the KMP preprocessing function to find the [[lexicographically minimal string rotation]]. The failure function is progressively calculated as the string is rotated.
 
==Notes==
{{reflist|group=note}}
 
==References==
{{Reflist}}
* {{cite book | first1=Thomas | last1=Cormen | author1-link=Thomas H. Cormen | first2=Charles E. | last2=Leiserson | author2-link=Charles E. Leiserson | first3=Ronald L. | last3=Rivest | author3-link=Ronald L. Rivest | first4=Clifford | last4=Stein | author4-link=Clifford Stein | title = Introduction to Algorithms | url=https://archive.org/details/introductiontoal00corm_691 | url-access=limited | edition = Second | publisher = MIT Press and McGraw-Hill | year = 2001 | isbn = 0-262-03293-7 | chapter = Section 32.4: The Knuth-Morris-Pratt algorithm | pages = 923–931[https://archive.org/details/introductiontoal00corm_691/page/n945 923]–931 | zbl=1047.68161 }}
* {{cite book | last1=Crochemore | first1=Maxime | last2=Rytter | first2=Wojciech | author2-link = Wojciech Rytter | title=Jewels of stringology. Text algorithms | ___location=River Edge, NJ | publisher=World Scientific | year=2003 | isbn=981-02-4897-0 | zbl=1078.68151 | pages=20–25|title-link= Jewels of Stringology }}
* {{cite book | last=Szpankowski | first=Wojciech | authorlinkauthor-link = Wojciech Szpankowski | title=Average case analysis of algorithms on sequences | others=With a foreword by Philippe Flajolet | series=Wiley-Interscience Series in Discrete Mathematics and Optimization | ___location=Chichester | publisher=Wiley | year=2001 | isbn=0-471-24063-X | zbl=0968.68205 | pages=15–17,136–141 }}
 
==External links==
{{wikibooks|Algorithm implementation|String searching/Knuth-Morris-Pratt pattern matcher}}
* [httphttps://wwwpeople.cs.pitt.edu/~kirk/cs1501/animations/String.html String Searching Applet animation]
* [httphttps://www.ics.uci.edu/~eppstein/161/960227.html An explanation of the algorithm] and [httphttps://www.ics.uci.edu/~eppstein/161/kmp/ sample C++ code] by [[David Eppstein]]
* [http://www-igm.univ-mlv.fr/~lecroq/string/node8.html Knuth-Morris-Pratt algorithm] description and C code by Christian Charras and Thierry Lecroq
* [httphttps://www.inf.fh-flensburghwlang.de/lang/algorithmen/pattern/kmpen.htm Explanation of the algorithm from scratch] by FH FlensburgH.W. Lang
* [https://web.archive.org/web/20101227102334/http://oak.cs.ucla.edu/cs144/examples/KMPSearch.html Breaking down steps of running KMP] by Chu-Cheng Hsieh.
* [https://www.youtube.com/watch?v=Zj_er99KMb8 NPTELHRD YouTube lecture video]
* [https://www.youtube.com/watch?v=4jY57Ehc14Y LogicFirst YouTube lecture video]
* [http://toccata.lri.fr/gallery/kmp.en.html Proof of correctness]
* [http://www.avhohlov.narod.ru/p2250en.htm Transformation between different forms of algorithm] {{Webarchive|url=https://web.archive.org/web/20230707162253/http://www.avhohlov.narod.ru/p2250en.htm|date=July 7, 2023}}
* [https://github.com/rvhuang/kmp-algorithm Knuth-Morris-Pratt algorithm written in C#]
* [https://www.w3spot.com/2020/07/kmp-algorithm-explained-in-plain-english.html KMP algorithm search time complexity explained in plain English]
 
{{Strings}}
Line 465 ⟶ 510:
[[Category:Donald Knuth]]
[[Category:Articles with example pseudocode]]
[[Category:1970 in computer sciencecomputing]]