Rabin–Karp algorithm: Difference between revisions

Content deleted Content added
Dcoetzee (talk | contribs)
Lots more details and paragraphing and explanation and pseudocode
Dcoetzee (talk | contribs)
More details
Line 4:
 
== Shifting substrings search and competing algorithms ==
The basic problem the algorithm addresses is finding a pattern (fixed substring) of length ''m'' within a text of length ''n,''; for example, finding the string "sun" in the sentence "Hello sunshine in this vale of tears." ByThe waysimplest of comparison of the algorithms, a naive brute force string searchingconceivable algorithm involvesfor aligningthis thetask firstjust characterlooks offor the pattern with the first character of the text and checking if the pattern matches the initial substring ofat equivalent length. If it does not match, the pattern is shifted right by 1 and the procedure is repeated, for a worst case running time of m*n. [[Knuth-Morris-Pratt algorithm]] is an elegant elaboration on this naive algorithm that uses precomputed data to skip forward not by 1 character, but by as many asall possible for the search to succeed.positions:
'''1''' '''function''' NaiveSearch(''string'' s[1..n], ''string'' sub[1..m])
'''2''' '''for''' i '''from''' 1 '''to''' n
'''3''' '''for''' j '''from''' 1 '''to''' m
'''4''' '''if''' s[i+j-1] ≠ sub[j]
'''5''' jump to next iteration of outer loop
'''6''' '''return''' i
'''7''' '''return''' not found
This algorithm works well in many practical cases, but fails spectacularly on certain examples, such as searching for a string of 10,000 "a"s in a string of 10 million "a"s, in which case it exhibits its worst-case [[Big-O notation|Ω]](''mn'') time.
 
[[Knuth-Morris-Pratt algorithm]] is an elegant elaboration on this naive algorithm that uses precomputed data to skip forward not by 1 character, but by as many as possible for the search to succeed. However, Rabin-Karp focuses instead on speeding up lines 3-6, as we'll discuss in the next section.
 
== Use of hashing for shifting substring search ==
Rabin-Karp algorithm takes a very different approach. Rather than pursuing more sophisticated skipping, itthe Rabin-Karp algorithm seeks to speed up the testing of equality of the pattern to the substrings in the text by using a [[hash function]]. A hash function is a function which converts every string into a numeric value, called its ''hash value''; for example, we might have hash("hello")=5. Rabin-Karp exploits the fact that if two strings are equal, their hash values are also equal. Thus, it would seem all we have to do is compute the hash value of the substring we're searching for, and then look for a substring with the same hash value.
 
However, there are two problems with this. First, because there are so many different strings, to keep the hash values small we have to assign some strings the same number. This means that if the hash values match, the strings might not match; we have to verify that they do, which can take a long time for long substrings. Luckily, a good hash function promises us that on most reasonable inputs, this won't happen too often, which keeps the average search time good.
Line 14 ⟶ 24:
 
{{wikicode}}
'''1''' '''function''' RabinKarp(''string'' s[1..n], ''string'' sub[1..m]) {
'''2''' hsub := hash(sub[1..m])
'''3''' hs := hash(s[1..m])
Line 22 ⟶ 32:
'''7''' '''return''' i
'''8''' hs := hash(s[i+1..i+m])
'''9''' } '''return''' not found
 
Lines 2, 3, and 6 each require [[Big-O notation|Ω]](m) time. However, lines 2 and 3 are only executed once, and line 6 is only executed if the hash values match, which is unlikely to happen more than a few times. Line 5 is executed ''n'' times, but only requires constant time. We now come to the second problem: line 8.
Line 58 ⟶ 68:
'''return''' i
hs := hash(s[i+1..i+m])
'''return''' not found
}