Revision as of 07:02, 14 November 2004 edit Dcoetzee (talk \| contribs) 37,529 edits Lots more details and paragraphing and explanation and pseudocode ← Previous edit		Revision as of 07:10, 14 November 2004 edit undo Dcoetzee (talk \| contribs) 37,529 edits More details Next edit →
Line 4: == Shifting substrings search and competing algorithms == The basic problem the algorithm addresses is finding a ~~pattern (~~fixed substring) of length ''m'' within a text of length ''n,''; for example, finding the string "sun" in the sentence "Hello sunshine in this vale of tears." ByThe ~~way~~simplest ~~of comparison of the algorithms, a naive brute force string searching~~conceivable algorithm ~~involves~~for ~~aligning~~this ~~the~~task ~~first~~just ~~character~~looks offor the ~~pattern with the first character of the text and checking if the pattern matches the initial~~ substring ofat equivalent length. If it does not match, the pattern is shifted right by 1 and the procedure is repeated, for a worst case running time of m*n. [[Knuth-Morris-Pratt algorithm]] is an elegant elaboration on this naive algorithm that uses precomputed data to skip forward not by 1 character, but by as many asall possible ~~for the search to succeed.~~positions: '''1''' '''function''' NaiveSearch(''string'' s[1..n], ''string'' sub[1..m]) '''2''' '''for''' i '''from''' 1 '''to''' n '''3''' '''for''' j '''from''' 1 '''to''' m '''4''' '''if''' s[i+j-1] ≠ sub[j] '''5''' jump to next iteration of outer loop '''6''' '''return''' i '''7''' '''return''' not found This algorithm works well in many practical cases, but fails spectacularly on certain examples, such as searching for a string of 10,000 "a"s in a string of 10 million "a"s, in which case it exhibits its worst-case [[Big-O notation\|Ω]](''mn'') time. [[Knuth-Morris-Pratt algorithm]] is an elegant elaboration on this naive algorithm that uses precomputed data to skip forward not by 1 character, but by as many as possible for the search to succeed. However, Rabin-Karp focuses instead on speeding up lines 3-6, as we'll discuss in the next section. == Use of hashing for shifting substring search == ~~Rabin-Karp algorithm takes a very different approach.~~ Rather than pursuing more sophisticated skipping, itthe Rabin-Karp algorithm seeks to speed up the testing of equality of the pattern to the substrings in the text by using a [[hash function]]. A hash function is a function which converts every string into a numeric value, called its ''hash value''; for example, we might have hash("hello")=5. Rabin-Karp exploits the fact that if two strings are equal, their hash values are also equal. Thus, it would seem all we have to do is compute the hash value of the substring we're searching for, and then look for a substring with the same hash value. However, there are two problems with this. First, because there are so many different strings, to keep the hash values small we have to assign some strings the same number. This means that if the hash values match, the strings might not match; we have to verify that they do, which can take a long time for long substrings. Luckily, a good hash function promises us that on most reasonable inputs, this won't happen too often, which keeps the average search time good. Line 14 ⟶ 24: {{wikicode}} '''1''' '''function''' RabinKarp(''string'' s[1..n], ''string'' sub[1..m]) { '''2''' hsub := hash(sub[1..m]) '''3''' hs := hash(s[1..m]) Line 22 ⟶ 32: '''7''' '''return''' i '''8''' hs := hash(s[i+1..i+m]) '''9''' } '''return''' not found Lines 2, 3, and 6 each require [[Big-O notation\|Ω]](m) time. However, lines 2 and 3 are only executed once, and line 6 is only executed if the hash values match, which is unlikely to happen more than a few times. Line 5 is executed ''n'' times, but only requires constant time. We now come to the second problem: line 8. Line 58 ⟶ 68: '''return''' i hs := hash(s[i+1..i+m]) '''return''' not found }

Rabin–Karp algorithm: Difference between revisions