Rabin–Karp algorithm: Difference between revisions

Content deleted Content added
Dcoetzee (talk | contribs)
Link original paper
Topbanana (talk | contribs)
m Spelling; is -> it
Line 8:
 
== Hash function used ==
The key to Rabin-Karp performance is the efficient computation of [[hash value]]s of the successive substrings of the text. Rabin-Karp achieves this by treating every substring as a number in some base, the base being usually a big [[prime]]. For example, if the substring is "hi" and the base is 101, the hash value would be 104 * 101^1 + 105 * 101^0 = 10609 ([[ASCII]] of 'h' is 104 and of 'i' is 105). Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See [[hash function]] for much more detailed discussion. The essential benefit achieved by such representation is that isit is possible to compute the hash value of the next substring from the previous one by doing only a finite number of operations, independent of the substrings' length. For example, if we have text "abracadabra" and we are searching for a pattern of length 3, we can compute the hash of "bra" from the hash for "abr" (the previous substring) by subtracting the number added for the first 'a' of "abr", i.e. 97 * 101^2 (97 is ASCII for 'a' and 101 is the base we are using) and adding that for the last a of "bra", i.e. 97 * 101^0 = 97. If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes. Theoretically, there exist other algorithms that could provide convenient recompution, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing by the first character and multiplying by the last. The limitation, however, is the limited of the size of integer [[data type]] and the necessity of using [[modular arithmetic]] to scale down the hash results, for which see [[hash function]] article; meanwhile, those naive hash functions that would not produce large numbers quickly, like just adding ASCII values, are likely to cause many [[hash collision]]s and hence slow down the algorithm. Hence the number in arbitrary base hash function is the preferred one in Rabin-Karp.
 
== Rabin-Karp and multiple pattern search ==