Rabin–Karp algorithm: Difference between revisions

Content deleted Content added
m dash, replaced: Boyer–Moore string search algorithm → Boyer–Moore string-search algorithm
 
(11 intermediate revisions by 7 users not shown)
Line 47:
Lines 2, 4, and 6 each require [[Big-O notation|O]](''m'') time. However, line 2 is only executed once, and line 6 is only executed if the hash values match, which is unlikely to happen more than a few times. Line 5 is executed O(''n'') times, but each comparison only requires constant time, so its impact is O(''n''). The issue is line 4.
 
Naively computing the hash value for the substring <code>s[i+1..i+m]</code> requires [[Big-O notation|O]](''m'') time because each character is examined. Since the hash computation is done on each loop, the algorithm with a naive hash computation requires [[Big-O notation|O]](''mn'') time, the same complexity as a straightforward string matching algorithm. For speed, the hash must be computed in constant time. The trick is the variable <code>hs</code> already contains the previous hash value of <code>s[i..i+m-1]</code>. If that value can be used to compute the next hash value in constant time, then computing successive hash values will be fast.
 
The trick can be exploited using a [[rolling hash]]. A rolling hash is a hash function specially designed to enable this operation. A trivial (but not very good) rolling hash function just adds the values of each character in the substring. This rolling hash formula can compute the next hash value from the previous value in constant time:
Line 55:
This simple function works, but will result in statement 5 being executed more often than other more sophisticated rolling hash functions such as those discussed in the next section.
 
Good performance requires a good hashing function for the encountered data. If the hashing is poor (such as producing the same hash value for every input), then line 6 would be executed [[Big-O notation|O]](''n'') times (i.e. on every iteration of the loop). Because character-by-character comparison of strings with length ''m'' takes [[Big-O notation|O]](''m'') time, the whole algorithm then takes a worst-case [[Big-O notation|O]](''mn'') time.
 
== Hash function used ==
Line 62:
 
For example, if the substring is "hi", the base is 256, and prime modulus is 101, then the hash value would be
[(104 &times; 256 ) %{{efn|name=mod}} 101 + 105] % 101 = 65
([[ASCII]] of 'h' is 104 and of 'i' is 105)
<sub> '%' is 'mod' or modulo, or remainder after integer division, operator </sub>
 
Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See [[hash function]] for a much more detailed discussion. The essential benefit achieved by using a [[rolling hash]] such as the Rabin fingerprint is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths.
Line 72 ⟶ 71:
hash("abr") = [ ( [ ( [ (97 &times; 256) % 101 + 98 ] % 101 ) &times; 256 ] % 101 ) + 114 ] % 101 = 4
 
We can then compute the hash of the next substring, "bra", from the hash of "abr" by subtracting the number added for the first 'a' of "abr", i.e. 97 &times; 256<sup>2</sup>, multiplying by the base and adding for the last a of "bra", i.e. 97 &times; 256<sup>0</sup>. Like so:
{{pre|style=font-size:95%|1=
 
// ''old hash (-ve avoider)*{{efn|name=ua}} old 'a' left base offset base shift new 'a''' prime modulus
hash("bra") = [ ( 4 + 101 - 97 * [(256%101)*256] % 101{{efn|name=mod101}} ) * 256 {{efn|name=times256}} + 97 ] % 101 = 30
}}
 
<sub> * (-ve avoider) = "underflow avoider". Necessary if using unsigned integers for calculations. Because we know all hashes <math>h \leq p</math> for prime modulus $p$, we can ensure no underflow by adding p to the old hash before subtracting the value corresponding to the old 'a' (mod p).</sub>
 
<sub> the last '* 256' is the shift of the subtracted hash to the left </sub>
 
<sub> although ((256%101)*256)%101 is the same as 256<sup>2</sup> % 101, to avoid overflowing integer maximums when the pattern string is longer (e.g. 'Rabin-Karp' is 10 characters, 256<sup>9</sup> is the offset without modulation ), the pattern length base offset is pre-calculated in a loop, modulating the result each iteration </sub>
 
If we are matching the search string "bra", using similar calculation of hash("abr"),
 
Line 90 ⟶ 82:
If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes.
 
Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing the previous hash by the first character value, then multiplying by the new last character's value. The limitation, however, is the limited size of the integer [[data type]] and the necessity of using [[modular arithmetic]] to scale down the hash results, (see.{{efn|See [[hash function]] article).}} Meanwhile, naive hash functions do not produce large numbers quickly, but, just like adding ASCII values, are likely to cause many [[hash collision]]s and hence slow down the algorithm. Hence the described hash function is typically the preferred one in the Rabin–Karp algorithm.
 
== Multiple pattern search ==
Line 112 ⟶ 104:
We assume all the substrings have a fixed length ''m''.
 
A naïve way to search for ''k'' patterns is to repeat a single-pattern search taking [[Big-O notation|O]](''n''+''m'') time, totaling in [[Big-O notation|O]]((''(n''+''m'')''k'') time. In contrast, the above algorithm can find all ''k'' patterns in [[Big-O notation|O]](''n''+''km'') expected time, assuming that a hash table check works in [[Big-O notation|O]](1) expected time.
 
==Notes==
{{notelist|refs=
<sub> '{{efn|name=mod|{{char|%'}} is 'mod' or [[modulo]], or remainder after integer division, operator </sub>.}}
<sub> * {{efn|name=ua|1=(-ve avoider) = "underflow avoider". Necessary if using unsigned integers for calculations. Because we know all hashes <math>h \leq p</math> for prime modulus ${{mvar|p$}}, we can ensure no underflow by adding {{mvar|p}} to the old hash before subtracting the value corresponding to the old 'a' (mod {{mvar|p}}).</sub>}}
<sub> {{efn|name=times256|the last '{{code|* 256'}} is the shift of the subtracted hash to the left </sub>.}}
<sub> {{efn|name=mod101|although {{code|((256%101)*256)%101}} is the same as 256<sup>2</sup> %mod 101, to avoid overflowing integer maximums when the pattern string is longer (e.g. 'Rabin-Karp' is 10 characters, 256<sup>9</sup> is the offset without modulation ), the pattern length base offset is pre-calculated in a loop, modulating the result each iteration </sub>.}}
}}
 
==References==
{{Reflist}}
 
===Sources===
{{Reflist}}
* {{cite book| first1 = K. Selçuk | last1 = Candan | first2 = Maria Luisa | last2 = Sapino|title=Data Management for Multimedia Retrieval|url=https://books.google.com/books?id=Uk9tyXgQME8C&pg=PA205| date = 2010|publisher=Cambridge University Press|isbn=978-0-521-88739-7|pages=205–206}} (for the Bloom filter extension)
* {{Cite book | last1 = Cormen | first1 = Thomas H. | author-link1 = Thomas H. Cormen | author-link2 = Charles E. Leiserson | last2 = Leiserson | first2 = Charles E. | author-link3 = Ronald L. Rivest | last3 = Rivest | first3 = Ronald L. | author-link4 = Clifford Stein | last4 = Stein | first4 = Clifford |title=[[Introduction to Algorithms]] |orig-year=1990 |edition=2nd |date=2001-09-01 |publisher=MIT Press |___location=[[Cambridge, Massachusetts]] |isbn=978-0-262-03293-3 |pages=911–916 |chapter=The Rabin–Karp algorithm}}
* {{Cite journal |last1=Karp|first1= Richard M. | author-link=Richard Karp | last2=Rabin|first2=Michael O.|author2-link=Michael O. Rabin | title=Efficient randomized pattern-matching algorithms |date=March 1987 |journal=IBM Journal of Research and Development |volume=31 |issue=2|pages=249–260|doi=10.1147/rd.312.0249 |citeseerx = 10.1.1.86.9502}}
* {{Cite book | last1 = Cormen | first1 = Thomas H. | author-link1 = Thomas H. Cormen | author-link2 = Charles E. Leiserson | last2 = Leiserson | first2 = Charles E. | author-link3 = Ronald L. Rivest | last3 = Rivest | first3 = Ronald L. | author-link4 = Clifford Stein | last4 = Stein | first4 = Clifford |title=[[Introduction to Algorithms]] |orig-year=1990 |edition=2nd |date=2001-09-01 |publisher=MIT Press |___location=[[Cambridge, Massachusetts]] |isbn=978-0-262-03293-3 |pages=911–916 |chapter=The Rabin–Karp algorithm}}
* {{cite book| first1 = K. Selçuk | last1 = Candan | first2 = Maria Luisa | last2 = Sapino|title=Data Management for Multimedia Retrieval|url=https://books.google.com/books?id=Uk9tyXgQME8C&pg=PA205| date = 2010|publisher=Cambridge University Press|isbn=978-0-521-88739-7|pages=205–206}} (for the Bloom filter extension)
* [https://yurichev.com/news/20210205_rolling_hash/ Yet another explanation]
 
==External links==