Content deleted Content added
m →Multiple pattern search: italics correction, rm redunt links |
m →Hash function used: {{pre}} |
||
(3 intermediate revisions by the same user not shown) | |||
Line 62:
For example, if the substring is "hi", the base is 256, and prime modulus is 101, then the hash value would be
[(104 × 256 ) %{{efn|name=mod}} 101 + 105] % 101 = 65
([[ASCII]] of 'h' is 104 and of 'i' is 105)
<sub> '%' is 'mod' or modulo, or remainder after integer division, operator </sub>▼
Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See [[hash function]] for a much more detailed discussion. The essential benefit achieved by using a [[rolling hash]] such as the Rabin fingerprint is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths.
Line 72 ⟶ 71:
hash("abr") = [ ( [ ( [ (97 × 256) % 101 + 98 ] % 101 ) × 256 ] % 101 ) + 114 ] % 101 = 4
We can then compute the hash of the next substring, "bra", from the hash of "abr" by subtracting the number added for the first 'a' of "abr", i.e. 97 × 256<sup>2</sup>, multiplying by the base and adding for the last a of "bra", i.e. 97 × 256<sup>0</sup>. Like so:
{{pre|style=font-size:95%|1=
}}
<sub> * (-ve avoider) = "underflow avoider". Necessary if using unsigned integers for calculations. Because we know all hashes <math>h \leq p</math> for prime modulus $p$, we can ensure no underflow by adding p to the old hash before subtracting the value corresponding to the old 'a' (mod p).</sub>▼
<sub> the last '* 256' is the shift of the subtracted hash to the left </sub>▼
<sub> although ((256%101)*256)%101 is the same as 256<sup>2</sup> % 101, to avoid overflowing integer maximums when the pattern string is longer (e.g. 'Rabin-Karp' is 10 characters, 256<sup>9</sup> is the offset without modulation ), the pattern length base offset is pre-calculated in a loop, modulating the result each iteration </sub>▼
If we are matching the search string "bra", using similar calculation of hash("abr"),
Line 90 ⟶ 82:
If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes.
Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing the previous hash by the first character value, then multiplying by the new last character's value. The limitation, however, is the limited size of the integer [[data type]] and the necessity of using [[modular arithmetic]] to scale down the hash results
== Multiple pattern search ==
Line 113 ⟶ 105:
A naïve way to search for ''k'' patterns is to repeat a single-pattern search taking O(''n''+''m'') time, totaling in O((''n''+''m'')''k'') time. In contrast, the above algorithm can find all ''k'' patterns in O(''n''+''km'') expected time, assuming that a hash table check works in O(1) expected time.
==Notes==
{{notelist|refs=
▲
▲
▲
▲
}}
==References==
|