Knuth–Morris–Pratt algorithm: Difference between revisions

Content deleted Content added
rm someone's personal site per WP:EL
OAbot (talk | contribs)
m Open access bot: url-access=subscription updated in citation with #oabot.
 
(4 intermediate revisions by 4 users not shown)
Line 11:
| language = ru
| last1 = Матиясевич
| first1 = Юрий
| title = О распознавании в реальное время отношения вхождения
| journal = Записки научных семинаров Ленинградского отделения Математического института им. В.А.Стеклова
| volume = 20
| year = 1971
| pages = 104–114
Line 21:
| language = en
| last1 = Matiyasevich
| first1 = Yuri
| title = Real-time recognition of the inclusion relation
| journal = Journal of Soviet Mathematics
| volume = 1
| year = 1973
| pages = 64–70
Line 30:
| s2cid = 121919479
| url = http://logic.pdmi.ras.ru/~yumat/Journal/inclusion/inclusion.pdf.gz
| access-date = 2017-07-04
| archive-date = 2021-04-30
| archive-url = https://web.archive.org/web/20210430124227/https://logic.pdmi.ras.ru/~yumat/Journal/inclusion/inclusion.pdf.gz
| url-status = live
}}</ref><ref>Knuth mentions this fact in the errata of his book ''Selected Papers on Design of Algorithms '' : {{quotation|I learned in 2012 that Yuri Matiyasevich had anticipated the linear-time pattern matching and pattern preprocessing algorithms of this paper, in the special case of a binary alphabet, already in 1969. He presented them as constructions for a Turing machine with a two-dimensional working memory.}}</ref> discovered a similar algorithm, coded by a two-dimensional Turing machine, while studying a string-pattern-matching recognition problem over a binary alphabet. This was the first linear-time algorithm for string matching.<ref>{{cite journal
| last1 = Amir | first1 = Amihood
Line 49 ⟶ 53:
The most straightforward algorithm, known as the "[[Brute-force search|brute-force]]" or "naive" algorithm, is to look for a word match at each index <code>m</code>, i.e. the position in the string being searched that corresponds to the character <code>S[m]</code>. At each position <code>m</code> the algorithm first checks for equality of the first character in the word being searched, i.e. <code>S[m] =? W[0]</code>. If a match is found, the algorithm tests the other characters in the word being searched by checking successive values of the word position index, <code>i</code>. The algorithm retrieves the character <code>W[i]</code> in the word being searched and checks for equality of the expression <code>S[m+i] =? W[i]</code>. If all successive characters match in <code>W</code> at position <code>m</code>, then a match is found at that position in the search string. If the index <code>m</code> reaches the end of the string then there is no match, in which case the search is said to "fail".
 
Usually, the trial check will quickly reject the trial match. If the strings are uniformly distributed random letters, then the chance that characters match is 1 in 26. In most cases, the trial check will reject the match at the initial letter. The chance that the first two letters will match is 1 in 26 (1 in 26^2 chances of a match over 26 possible letters). So if the characters are random, then the expected complexity of searching string <code>S[]</code> of length ''n'' is on the order of ''n'' comparisons or ''OΘ''(''n''). The expected performance is very good. If <code>S[]</code> is 1 million characters and <code>W[]</code> is 1000 characters, then the string search should complete after about 1.04 million character comparisons.
 
That expected performance is not guaranteed. If the strings are not random, then checking a trial <code>m</code> may take many character comparisons. The worst case is if the two strings match in all but the last letter. Imagine that the string <code>S[]</code> consists of 1 million characters that are all ''A'', and that the word <code>W[]</code> is 999 ''A'' characters terminating in a final ''B'' character. The simple string-matching algorithm will now examine 1000 characters at each trial position before rejecting the match and advancing the trial position. The simple string search example would now take about 1000 character comparisons times 1 million positions for 1 billion character comparisons. If the length of <code>W[]</code> is ''k'', then the worst-case performance is ''O''(''k''&sdot;''n'').
Line 189 ⟶ 193:
 
===Working example of the table-building algorithm===
We consider the example of <code>W = "ABCDABD"</code> first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. We set <code>T[0] = -1</code>. To find <code>T[1]</code>, we must discover a [[Substring#Suffix|proper suffix]] of <code>"A"</code> which is also a prefix of pattern <code>W</code>. But there are no proper suffixes of <code>"A"</code>, so we set <code>T[1] = 0</code>. To find <code>T[2]</code>, we see that the substring <code>W[0]</code> - <code>W[1]</code> (<code>"AB"</code>) has a proper suffix <code>"B"</code>. However "B" is not a prefix of the pattern <code>W</code>. Therefore, we set <code>T[2] = 0</code>.
 
Continuing to <code>T[3]</code>, we first check the proper suffix of length 1, and as in the previous case it fails. Should we also check longer suffixes? No, we now note that there is a shortcut to checking ''all'' suffixes: let us say that we discovered a [[Substring#Suffix|proper suffix]] which is a [[Substring#Prefix|proper prefix]] (Aa proper prefix of a string is not equal to the string itself) and ending at <code>W[2]</code> with length 2 (the maximum possible); then its first character is also a proper prefix of <code>W</code>, hence a proper prefix itself, and it ends at <code>W[1]</code>, which we already determined did not occur as <code>T[2] = 0</code> and not <code>T[2] = 1</code>. Hence at each stage, the shortcut rule is that one needs to consider checking suffixes of a given size m+1 only if a valid suffix of size m was found at the previous stage (i.e. <code>T[x] = m</code>) and should not bother to check m+2, m+3, etc.
 
Therefore, we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so <code>T[3] = 0</code>.
Line 438 ⟶ 442:
Since the two portions of the algorithm have, respectively, complexities of <code>O(k)</code> and <code>O(n)</code>, the complexity of the overall algorithm is <code>O(n + k)</code>.
 
These complexities are theindependent same, no matterof how many repetitive patterns are in <code>W</code> or <code>S</code>.
 
It is known that the delay, that is the number of times a symbol of the text is compared to symbols of the pattern, is less than
<math>\lfloor \log_\Phi(k+1)\rfloor</math>, where Φ is the golden ration <math>(1+\sqrt 5)/2</math>. In 1993, an algorithm was given
that has a delay bounded by <math>\min(1+\lfloor \log_2 k\rfloor,|\Sigma|)</math> where Σ is the size of the alphabet (of the pattern).<ref>{{Cite conference
| last = Simon
| first = Imre
| title = String matching algorithms and automata
| book-title = Results and Trends in Theoretical Computer Science: Colloquium in Honor of Arto Salomaa
| editor =
| publisher = Springer
| ___location =
| pages = 386-395
| date =
| year = 1994
| url =
| doi =
}}</ref><ref>{{Cite journal
| last = Hancart
| first = Christophe
| title = On Simon's String Searching Algorithm
| journal = Information Processing Letters
| volume = 47
| issue = 2
| pages = 65-99
| date =
| year = 1993
| doi = 10.1016/0020-0190(93)90231-W
| pmid =
| url = https://doi.org/10.1016/0020-0190(93)90231-W
| url-access = subscription
}}</ref>
 
==Variants==