Knuth–Morris–Pratt algorithm: Difference between revisions

Content deleted Content added
Ryan Reich (talk | contribs)
A less catastrophic post-catastrophe update to clarify the clarified explanation by replacing many English words by a few mathematical symbols, and also to fix the algorithm.
Ryan Reich (talk | contribs)
No edit summary
Line 12:
In this article, we will be using [[Array#Indices into Arrays|zero-based arrays]] for our strings; thus the 'C' in <math>P</math> will be denoted <math>P[2]</math>. Let <math>m</math> be the start of the currently matched substring within <math>S</math>, <math>i</math> the position within <math>P</math>, and <math>j</math> the absolute position within <math>S</math>.
 
We begin by matching the first three characters one after another. Thus in the fourth step, <math>m = 0</math>, <math>i = 3</math>, <math>j = 3</math>. But <math>S[3]</math> is a space and <math>P[3] = 'D'</math>, so we have a mismatch. Rather than beginning again <math>m = 1</math>, we note that no 'A' occurs between positions 0 and 3 in <math>P</math> except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again. Therefore we move on to the next character, setting <math>m = 4</math>, <math>i = 0</math>, <math>j = 4</math>.
 
We quickly obtain a nearly complete match 'ABCDAB' when, at <math>i = 6<math>, </math>j = 10</math> we again have a discrepancy. However, just prior to the end of the current partial match we passed an 'AB', which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match the two characters prior to the current position, we need not check them again; we simply reset <math>m = 8</math>, <math>i = 2</math> and continue matching the current character.
 
This search fails immediately, however, as the pattern still does not contain a ' ', so as in the first trial, we increment <math>m = 11</math> and reset <math>i = 0</math>, <math>j = 11</math>, and begin searching anew. Once again we immediately hit upon a match 'ABCDAB' but the next character, 'C', does not match the final character 'D' of the pattern. Reasoning as before, we set <math>m = 17</math>, to start at the two-character string 'AB' leading up to the current position, set <math>i = 2</math>, and continue matching from the current position. This time we are able to complete the match, and return the position 17 as its origin.
 
==The searching algorithm==
Line 25:
# Compare <math>P[i]</math> versus <math>S[j]</math>. At this point we have already matched <math>i</math> previous characters.
#* If they are equal, set <math>i = i + 1</math>, <math>j = j + 1</math>. If <math>i = n</math> then we have found a full match; terminate the algorithm and return <math>m</math> as the start of the match.
#* If they are unequal, let <math>e = T[i - 1]</math> be an entry in the "partial match" table described below, at position <math>i - 1</math>. Set <math>m = j - e</math>, and <math>i = e</math> if <math>e \geqslant 0</math>, or <math>m = j + 1</math>, <math>j = j + 1</math>, and <math>i = 0</math> if <math>e = -1</math>.
# Return to step 2.