Knuth–Morris–Pratt algorithm: Difference between revisions

Content deleted Content added
m Incorrect chance of a random match. Didn’t take into account that there are 26 possible matches.
Tags: Visual edit Mobile edit Mobile web edit
Likes2wiki (talk | contribs)
m Background: fix capitalization
Line 47:
A string-matching algorithm wants to find the starting index <code>m</code> in string <code>S[]</code> that matches the search word <code>W[]</code>.
 
The most straightforward algorithm, known as the "[[Brute-force search|Brutebrute-force]]" or "Naivenaive" algorithm, is to look for a word match at each index <code>m</code>, i.e. the position in the string being searched that corresponds to the character <code>S[m]</code>. At each position <code>m</code> the algorithm first checks for equality of the first character in the word being searched, i.e. <code>S[m] =? W[0]</code>. If a match is found, the algorithm tests the other characters in the word being searched by checking successive values of the word position index, <code>i</code>. The algorithm retrieves the character <code>W[i]</code> in the word being searched and checks for equality of the expression <code>S[m+i] =? W[i]</code>. If all successive characters match in <code>W</code> at position <code>m</code>, then a match is found at that position in the search string. If the index <code>m</code> reaches the end of the string then there is no match, in which case the search is said to "fail".
 
Usually, the trial check will quickly reject the trial match. If the strings are uniformly distributed random letters, then the chance that characters match is 1 in 26. In most cases, the trial check will reject the match at the initial letter. The chance that the first two letters will match is 1 in 26 (1 in 26^2 chances of a match over 26 possible letters). So if the characters are random, then the expected complexity of searching string <code>S[]</code> of length ''n'' is on the order of ''n'' comparisons or ''O''(''n''). The expected performance is very good. If <code>S[]</code> is 1 million characters and <code>W[]</code> is 1000 characters, then the string search should complete after about 1.04 million character comparisons.