Rabin–Karp algorithm: Difference between revisions

Content deleted Content added
Dcoetzee (talk | contribs)
m Hash function used: ^ -> sup tags
Dcoetzee (talk | contribs)
Line 47:
 
== Hash function used ==
The key to Rabin-Karp performance is the efficient computation of [[hash value]]s of the successive substrings of the text. Rabin-Karp achieves this by treating every substring as a number in some base, the base being usually a big [[prime]]. For example, if the substring is "hi" and the base is 101, the hash value would be 104 *&times; 101<sup>1</sup> + 105 *&times; 101<sup>0</sup> = 10609 ([[ASCII]] of 'h' is 104 and of 'i' is 105).
 
Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See [[hash function]] for a much more detailed discussion. The essential benefit achieved by such representation is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths.
 
For example, if we have text "abracadabra" and we are searching for a pattern of length 3, we can compute the hash of "bra" from the hash for "abr" (the previous substring) by subtracting the number added for the first 'a' of "abr", i.e. 97 *&times; 101<sup>2</sup> (97 is ASCII for 'a' and 101 is the base we are using), multiplying by the base and adding for the last a of "bra", i.e. 97 *&times; 101<sup>0</sup> = 97. If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes.
 
Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing by the first character and multiplying by the last. The limitation, however, is the limited of the size of integer [[data type]] and the necessity of using [[modular arithmetic]] to scale down the hash results, for which see [[hash function]] article; meanwhile, those naive hash functions that would not produce large numbers quickly, like just adding ASCII values, are likely to cause many [[hash collision]]s and hence slow down the algorithm. Hence the described hash function is typically the preferred one in Rabin-Karp.