Rabin–Karp algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 23:49, 31 March 2025 edit Cedar101 (talk \| contribs) Extended confirmed users 18,555 edits m →Multiple pattern search: italics correction, rm redunt links ← Previous edit		Latest revision as of 00:08, 1 April 2025 edit undo Cedar101 (talk \| contribs) Extended confirmed users 18,555 edits m →Hash function used: {{pre}}
(3 intermediate revisions by the same user not shown)
Line 62: For example, if the substring is "hi", the base is 256, and prime modulus is 101, then the hash value would be [(104 × 256 ) %{{efn\|name=mod}} 101 + 105] % 101 = 65 ([[ASCII]] of 'h' is 104 and of 'i' is 105) <sub> '%' is 'mod' or modulo, or remainder after integer division, operator </sub>▼ Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See [[hash function]] for a much more detailed discussion. The essential benefit achieved by using a [[rolling hash]] such as the Rabin fingerprint is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths. Line 72 ⟶ 71: hash("abr") = [ ( [ ( [ (97 × 256) % 101 + 98 ] % 101 ) × 256 ] % 101 ) + 114 ] % 101 = 4 We can then compute the hash of the next substring, "bra", from the hash of "abr" by subtracting the number added for the first 'a' of "abr", i.e. 97 × 256<sup>2</sup>, multiplying by the base and adding for the last a of "bra", i.e. 97 × 256<sup>0</sup>. Like so: {{pre\|style=font-size:95%\|1= // ''old hash (-ve avoider){{efn\|name=ua}} old 'a' left base offset base shift new 'a''' prime modulus hash("bra") = [ ( 4 + 101 - 97 [(256%101)256] % 101{{efn\|name=mod101}} ) 256 {{efn\|name=times256}} + 97 ] % 101 = 30 }} <sub> * (-ve avoider) = "underflow avoider". Necessary if using unsigned integers for calculations. Because we know all hashes <math>h \leq p</math> for prime modulus $p$, we can ensure no underflow by adding p to the old hash before subtracting the value corresponding to the old 'a' (mod p).</sub>▼ <sub> the last '* 256' is the shift of the subtracted hash to the left </sub>▼ <sub> although ((256%101)256)%101 is the same as 256<sup>2</sup> % 101, to avoid overflowing integer maximums when the pattern string is longer (e.g. 'Rabin-Karp' is 10 characters, 256<sup>9</sup> is the offset without modulation ), the pattern length base offset is pre-calculated in a loop, modulating the result each iteration </sub>▼ If we are matching the search string "bra", using similar calculation of hash("abr"), Line 90 ⟶ 82: If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes. Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing the previous hash by the first character value, then multiplying by the new last character's value. The limitation, however, is the limited size of the integer [[data type]] and the necessity of using [[modular arithmetic]] to scale down the hash results~~, (see~~.{{efn\|See [[hash function]] article).}} Meanwhile, naive hash functions do not produce large numbers quickly, but, just like adding ASCII values, are likely to cause many [[hash collision]]s and hence slow down the algorithm. Hence the described hash function is typically the preferred one in the Rabin–Karp algorithm. == Multiple pattern search == Line 113 ⟶ 105: A naïve way to search for ''k'' patterns is to repeat a single-pattern search taking O(''n''+''m'') time, totaling in O((''n''+''m'')''k'') time. In contrast, the above algorithm can find all ''k'' patterns in O(''n''+''km'') expected time, assuming that a hash table check works in O(1) expected time. ==Notes== {{notelist\|refs= ▲~~<sub> '~~{{efn\|name=mod\|{{char\|%'}} is 'mod' or [[modulo]], or remainder after integer division, operator ~~</sub>~~.}} ▲<sub> {{efn\|name=ua\|1=(-ve avoider) = "underflow avoider". Necessary if using unsigned integers for calculations. Because we know all hashes <math>h \leq p</math> for prime modulus ${{mvar\|p$}}, we can ensure no underflow by adding {{mvar\|p}} to the old hash before subtracting the value corresponding to the old 'a' (mod {{mvar\|p}}).~~</sub>~~}} ▲~~<sub>~~ {{efn\|name=times256\|the last '{{code\|* 256'}} is the shift of the subtracted hash to the left ~~</sub>~~.}} ▲~~<sub>~~ {{efn\|name=mod101\|although {{code\|((256%101)*256)%101}} is the same as 256<sup>2</sup> %mod 101, to avoid overflowing integer maximums when the pattern string is longer (e.g. 'Rabin-Karp' is 10 characters, 256<sup>9</sup> is the offset without modulation ), the pattern length base offset is pre-calculated in a loop, modulating the result each iteration ~~</sub>~~.}} }} ==References==