Rabin–Karp algorithm: Difference between revisions

Content deleted Content added
Dcoetzee (talk | contribs)
m The
 
(343 intermediate revisions by more than 100 users not shown)
Line 1:
{{Short description|String searching algorithm}}
The '''Rabin-Karp algorithm''' is a [[string searching algorithm]] that seeks a pattern, i.e. a substring, within a text by using [[hashing]]. It is not widely used for single pattern matching, but is of considerable theoretical importance and is very effective for multiple pattern matching. Historically, it predates the very popular [[Knuth-Morris-Pratt algorithm]]. For text of length ''n'' and pattern of length ''m'', its average and best case running time is [[Big-O notation|O]](''n''), but the (highly unlikely) worst case performance is O(''nm''), which is one of the reasons why it is not widely used. However, it has the unique advantage of being able to find any one of ''k'' strings or less in O(''n'') time on average, regardless of the size of ''k''.
{{no footnotes|date=September 2018}}
{{Infobox algorithm
|name = Rabin-Karp algorithm
|class = [[string-searching algorithm|String searching]]
|image = <!-- filename only, no "File:" or "Image:" prefix, and no enclosing [[brackets]] -->
|caption =
|data =
|time = <math>O(mn)</math> plus <math>O(m)</math> preprocessing time
|best-time =
|average-time = <math>O(n)</math>
|space = <math>O(1)</math>
}}
In [[computer science]], the '''Rabin–Karp algorithm''' or '''Karp–Rabin algorithm''' is a [[string-searching algorithm]] created by {{harvs|first1=Richard M.|last1=Karp|author1-link=Richard M. Karp|first2=Michael O.|last2=Rabin|author2-link=Michael O. Rabin|year=1987|txt}} that uses [[Hash function|hashing]] to find an exact match of a pattern string in a text. It uses a [[rolling hash]] to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions. Generalizations of the same idea can be used to find more than one match of a single pattern, or to find matches for more than one pattern.
 
To find a single match of a single pattern, the [[expected time]] of the algorithm is [[linear time|linear]] in the combined length of the pattern and text,
One of the simplest practical applications of Rabin-Karp is in detection of plagiarism. Say, for example, that a student is writing an English paper on ''[[Moby Dick]]''. A cunning professor might locate a variety of source material on ''Moby Dick'' and automatically extract a list of all sentences in those materials. Then, Rabin-Karp can rapidly search through a particular paper for any instance of any of the sentences from the source materials. To avoid easily thwarting the system with minor changes, it can be made to ignore details such as case and punctuation by removing these first. Because the number of strings we're searching for, ''k'', is very large, other string searching algorithms are impractical.
although its [[Worst-case complexity|worst-case time complexity]] is the product of the two lengths. To find multiple matches, the expected time is linear in the input lengths, plus the combined length of all the matches, which could be greater than linear. In contrast, the [[Aho–Corasick algorithm]] can find all matches of multiple patterns in worst-case time and space linear in the input length and the number of matches (instead of the total length of the matches).
 
A practical application of the algorithm is [[plagiarism detection|detecting plagiarism]]. Given source material, the algorithm can rapidly search through a paper for instances of sentences from the source material, ignoring details such as case and punctuation. Because of the abundance of the sought strings, single-string searching algorithms are impractical.
== Shifting substrings search and competing algorithms ==
The basic problem the algorithm addresses is finding a fixed substring of length ''m'', called the ''pattern'', within a text of length ''n''; for example, finding the string "sun" in the sentence "Hello sunshine in this vale of tears." One very simple algorithm for this task just looks for the substring at all possible positions:
 
==Overview==
{{wikicode}}
A naive string matching algorithm compares the given pattern against all positions in the given text. Each comparison takes time proportional to the length of the pattern,
'''1''' '''function''' NaiveSearch(''string'' s[1..n], ''string'' sub[1..m])
and the number of positions is proportional to the length of the text. Therefore, the worst-case time for such a method is proportional to the product of the two lengths.
'''2''' '''for''' i '''from''' 1 '''to''' n
In many practical cases, this time can be significantly reduced by cutting short the comparison at each position as soon as a mismatch is found, but this idea cannot guarantee any speedup.
'''3''' '''for''' j '''from''' 1 '''to''' m
'''4''' '''if''' s[i+j-1] &ne; sub[j]
'''5''' jump to next iteration of outer loop
'''6''' '''return''' i
'''7''' '''return''' not found
 
Several string-matching algorithms, including the [[Knuth–Morris–Pratt algorithm]] and the [[Boyer–Moore string-search algorithm]], reduce the worst-case time for string matching by extracting more information from each mismatch, allowing them to skip over positions of the text that are guaranteed not to match the pattern. The Rabin–Karp algorithm instead achieves its speedup by using a [[hash function]] to quickly perform an approximate check for each position, and then only performing an exact comparison at the positions that pass this approximate check.
This algorithm works well in many practical cases, but fails spectacularly on certain examples, such as searching for a string of 10,000 "a"s in a string of 10 million "a"s, in which case it exhibits its worst-case [[Big-O notation|&Omega;]](''mn'') time.
 
A hash function is a function which converts every string into a numeric value, called its ''hash value''; for example, we might have <code>hash("hello")=5</code>. If two strings are equal, their hash values are also equal. For a well-designed hash function, the inverse is true, in an approximate sense: strings that are unequal are very unlikely to have equal hash values. The Rabin–Karp algorithm proceeds by computing, at each position of the text, the hash value of a string starting at that position with the same length as the pattern. If this hash value equals the hash value of the pattern, it performs a full comparison at that position.
[[Knuth-Morris-Pratt algorithm]] is an elegant elaboration on this naive algorithm that uses precomputed data to skip forward not by 1 character, but by as many as possible for the search to succeed, effectively decreasing the number of times we iterate through the outer loop. However, Rabin-Karp focuses instead on speeding up lines 3-6, as we'll discuss in the next section.
 
In order for this to work well, the hash function should be selected randomly from a family of hash functions that are unlikely to produce many [[false positive]]s, that is, positions of the text which have the same hash value as the pattern but do not actually match the pattern. These positions contribute to the running time of the algorithm unnecessarily, without producing a match. Additionally, the hash function used should be a [[rolling hash]], a hash function whose value can be quickly updated from each position of the text to the next. Recomputing the hash function from scratch at each position would be too slow.
== Use of hashing for shifting substring search ==
Rather than pursuing more sophisticated skipping, the Rabin-Karp algorithm seeks to speed up the testing of equality of the pattern to the substrings in the text by using a [[hash function]]. A hash function is a function which converts every string into a numeric value, called its ''hash value''; for example, we might have hash("hello")=5. Rabin-Karp exploits the fact that if two strings are equal, their hash values are also equal. Thus, it would seem all we have to do is compute the hash value of the substring we're searching for, and then look for a substring with the same hash value.
 
==The algorithm==
However, there are two problems with this. First, because there are so many different strings, to keep the hash values small we have to assign some strings the same number. This means that if the hash values match, the strings might not match; we have to verify that they do, which can take a long time for long substrings. Luckily, a good hash function promises us that on most reasonable inputs, this won't happen too often, which keeps the average search time good.
 
The algorithm is as shown:
 
<syntaxhighlight lang="php" line highlight="7">
'''1''' '''function''' RabinKarp(''string'' s[1..n], ''string'' sub[1..m])
function RabinKarp(string s[1..n], string pattern[1..m])
'''2''' hsub := hash(sub[1..m])
'''3''' hshpattern := hash(spattern[1..m]);
'''4''' '''for''' i '''from''' 1 '''to''' n-m+1
'''5''' '''if''' hs := hsubhash(s[i..i+m-1])
'''6''' '''if''' s[i..i+m-1]hs = subhpattern
'''7''' if s[i..i+m-1] = '''return''' ipattern[1..m]
'''8''' hs := hash(s[i+1.. return i+m])
'''9''' '''return''' not found
</syntaxhighlight>
 
Lines 2, 34, and 6 each require [[Big-O notation|&Omega;O]](''m'') time. However, linesline 2 and 3 areis only executed once, and line 6 is only executed if the hash values match, which is unlikely to happen more than a few times. Line 5 is executed O(''n'') times, but each comparison only requires constant time., Weso nowits comeimpact tois theO(''n''). secondThe problem:issue is line 84.
 
If we naivelyNaively recomputecomputing the hash value for the substring <code>s[i+1..i+m]</code>, thisrequires would require [[Big-O notation|&Omega;]](''m'') time, andbecause sinceeach character is examined. Since the hash thiscomputation is done on each loop, the algorithm wouldwith requirea &Omega;naive hash computation requires O(''mn'') time, the same complexity as thea moststraightforward naivestring algorithmsmatching algorithm. TheFor trickspeed, tothe solvinghash thismust isbe tocomputed notein thatconstant time. The trick is the variable <code>hs</code> already contains the previous hash value of <code>s[i..i+m-1]</code>. If wethat value can usebe thisused to compute the next hash value in constant time, then ourcomputing successive hash problemvalues will be solvedfast.
 
WeThe dotrick thiscan usingbe whatexploited is calledusing a [[rolling hash]]. A rolling hash is a hash function specially designed to enable this operation. OneA simpletrivial example(but isnot very good) rolling hash function addingjust upadds the values of each character in the substring. Then,This werolling can use thishash formula tocan compute the next hash value from the previous value in constant time:
<pre>
s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]
s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]
This simple function works, but will result in statement 6 being executed more often than other more sophisticated rolling hash functions such as those discussed in the next section.
</pre>
This simple function works, but will result in statement 5 being executed more often than other more sophisticated rolling hash functions such as those discussed in the next section.
 
NoticeGood thatperformance ifrequires we'rea verygood unlucky,hashing orfunction havefor athe veryencountered baddata. hashIf functionthe hashing is poor (such as aproducing constantthe functionsame hash value for every input), then line 6 might very wellwould be executed O(''n'') times, (i.e. on every iteration of the loop). Because itcharacter-by-character requirescomparison &Omega;of strings with length ''m'' takes O(''m'') time, the whole algorithm then takes a worst-case &Omega;O(''mn'') time.
 
== Hash function used ==
{{main | Rabin fingerprint}}
The key to Rabin-Karp performance is the efficient computation of [[hash value]]s of the successive substrings of the text. Rabin-Karp achieves this by treating every substring as a number in some base, the base being usually a big [[prime]]. For example, if the substring is "hi" and the base is 101, the hash value would be 104 * 101^1 + 105 * 101^0 = 10609 ([[ASCII]] of 'h' is 104 and of 'i' is 105).
The key to the Rabin–Karp algorithm's performance is the efficient computation of [[hash value]]s of the successive substrings of the text. The [[Rabin fingerprint]] is a popular and effective rolling hash function. The hash function described here is not a Rabin fingerprint, but it works equally well. It treats every substring as a number in some base, the base being usually the size of the character set.
 
For example, if the substring is "hi", the base is 256, and prime modulus is 101, then the hash value would be
Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See [[hash function]] for a much more detailed discussion. The essential benefit achieved by such representation is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths.
[(104 &times; 256 ) %{{efn|name=mod}} 101 + 105] % 101 = 65
([[ASCII]] of 'h' is 104 and of 'i' is 105)
 
Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See [[hash function]] for a much more detailed discussion. The essential benefit achieved by using a [[rolling hash]] such as the Rabin fingerprint is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths.
For example, if we have text "abracadabra" and we are searching for a pattern of length 3, we can compute the hash of "bra" from the hash for "abr" (the previous substring) by subtracting the number added for the first 'a' of "abr", i.e. 97 * 101^2 (97 is ASCII for 'a' and 101 is the base we are using), multiplying by the base and adding for the last a of "bra", i.e. 97 * 101^0 = 97. If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes.
 
For example, if we have text "abracadabra" and we are searching for a pattern of length 3, the hash of the first substring, "abr", using 256 as the base, and 101 as the prime modulus is:
Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing by the first character and multiplying by the last. The limitation, however, is the limited of the size of integer [[data type]] and the necessity of using [[modular arithmetic]] to scale down the hash results, for which see [[hash function]] article; meanwhile, those naive hash functions that would not produce large numbers quickly, like just adding ASCII values, are likely to cause many [[hash collision]]s and hence slow down the algorithm. Hence the described hash function is typically the preferred one in Rabin-Karp.
// ASCII a = 97, b = 98, r = 114.
hash("abr") = [ ( [ ( [ (97 &times; 256) % 101 + 98 ] % 101 ) &times; 256 ] % 101 ) + 114 ] % 101 = 4
 
We can then compute the hash of the next substring, "bra", from the hash of "abr" by subtracting the number added for the first 'a' of "abr", i.e. 97 &times; 256<sup>2</sup>, multiplying by the base and adding for the last a of "bra", i.e. 97 &times; 256<sup>0</sup>. Like so:
== Rabin-Karp and multiple pattern search ==
{{pre|style=font-size:95%|1=
Rabin-Karp is inferior to [[Knuth-Morris-Pratt algorithm]], [[Boyer-Moore string searching algorithm]] and other faster single pattern [[string searching algorithm]]s because of its slow worst case behavior. However, Rabin-Karp is an algorithm of choice for multiple pattern search.
// ''old hash (-ve avoider){{efn|name=ua}} old 'a' left base offset base shift new 'a''' prime modulus
hash("bra") = [ ( 4 + 101 - 97 * [(256%101)*256] % 101{{efn|name=mod101}} ) * 256{{efn|name=times256}} + 97 ] % 101 = 30
}}
If we are matching the search string "bra", using similar calculation of hash("abr"),
 
hash'("bra") = [ ( [ ( [ ( 98 &times; 256) %101 + 114] % 101 ) &times; 256 ] % 101) + 97 ] % 101 = 30
That is, if we want to find any of a large number, say k, fixed length patterns in a text, we can create a simple variant of Rabin-Karp that uses a [[hash table]] or any other [[set data structure]] to check whether the hash of a given string belongs to a set of hash values of patterns we are looking for:
 
If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes.
'''function''' RabinKarpSet(''string'' s[1..n], ''set'' of ''string'' subs, m) {
''set'' hsubs := emptySet
'''for each''' sub '''in''' subs
insert hash(sub[1..m]) into hsubs
hs := hash(s[1..m])
'''for''' i '''from''' 1 '''to''' n
'''if''' hs &isin; hsubs
'''if''' s[i..i+m-1] = a substring with hash hs
'''return''' i
hs := hash(s[i+1..i+m])
'''return''' not found
}
 
Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing the previous hash by the first character value, then multiplying by the new last character's value. The limitation, however, is the limited size of the integer [[data type]] and the necessity of using [[modular arithmetic]] to scale down the hash results.{{efn|See [[hash function]] article.}} Meanwhile, naive hash functions do not produce large numbers quickly, but, just like adding ASCII values, are likely to cause many [[hash collision]]s and hence slow down the algorithm. Hence the described hash function is typically the preferred one in the Rabin–Karp algorithm.
Here we assume all the substrings have a fixed length ''m'', but this assumption can be eliminated. We simply compare the current hash value against the hash values of all the substrings simultaneously using a quick lookup in our set data structure, and then verify any match we find against all substrings with that hash value.
 
== Multiple pattern search ==
Other algorithms can search for a single pattern in time order O(n), hence they will search for k patterns in time order O(n*k). The variant Rabin-Karp above will still work in time order O(n) in the best and average case, because a hash table allows to check whether or not substring hash equals any of the pattern hashes in time order of O(1). We can also ensure O(''mn''log ''k'') time in the worst case, where ''m'' is the length of the longest of the ''k'' strings, by storing the hashes in a [[self-balancing binary search tree]] instead of a hash table.
The Rabin–Karp algorithm is inferior for single pattern searching to [[Knuth–Morris–Pratt algorithm]], [[Boyer–Moore string-search algorithm]] and other faster single pattern [[string searching algorithm]]s because of its slow worst case behavior. However, it is a useful algorithm for [[String searching algorithm#Algorithms using a finite set of patterns|multiple pattern search]].
 
To find any of a large number, say ''k'', fixed length patterns in a text, a simple variant of the Rabin–Karp algorithm uses a [[Bloom filter]] or a [[set data structure]] to check whether the hash of a given string belongs to a set of hash values of patterns we are looking for:
 
<syntaxhighlight lang="php" line>
function RabinKarpSet(string s[1..n], set of string subs, m):
set hsubs := emptySet
foreach sub in subs
insert hash(sub[1..m]) into hsubs
hs := hash(s[1..m])
for i from 1 to n-m+1
if hs ∈ hsubs and s[i..i+m-1] ∈ subs
return i
hs := hash(s[i+1..i+m])
return not found
</syntaxhighlight>
 
We assume all the substrings have a fixed length ''m''.
 
A naïve way to search for ''k'' patterns is to repeat a single-pattern search taking O(''n''+''m'') time, totaling in O((''n''+''m'')''k'') time. In contrast, the above algorithm can find all ''k'' patterns in O(''n''+''km'') expected time, assuming that a hash table check works in O(1) expected time.
 
==Notes==
{{notelist|refs=
{{efn|name=mod|{{char|%}} is 'mod' or [[modulo]], or remainder after integer division, operator.}}
{{efn|name=ua|1=(-ve avoider) = "underflow avoider". Necessary if using unsigned integers for calculations. Because we know all hashes <math>h \leq p</math> for prime modulus {{mvar|p}}, we can ensure no underflow by adding {{mvar|p}} to the old hash before subtracting the value corresponding to the old 'a' (mod {{mvar|p}}).}}
{{efn|name=times256|the last {{code|* 256}} is the shift of the subtracted hash to the left.}}
{{efn|name=mod101|although {{code|((256%101)*256)%101}} is the same as 256<sup>2</sup> mod 101, to avoid overflowing integer maximums when the pattern string is longer (e.g. 'Rabin-Karp' is 10 characters, 256<sup>9</sup> is the offset without modulation ), the pattern length base offset is pre-calculated in a loop, modulating the result each iteration.}}
}}
 
==References==
{{Reflist}}
 
===Sources===
* {{cite book| first1 = K. Selçuk | last1 = Candan | first2 = Maria Luisa | last2 = Sapino|title=Data Management for Multimedia Retrieval|url=https://books.google.com/books?id=Uk9tyXgQME8C&pg=PA205| date = 2010|publisher=Cambridge University Press|isbn=978-0-521-88739-7|pages=205–206}} (for the Bloom filter extension)
* {{Cite book | last1 = Cormen | first1 = Thomas H. | author-link1 = Thomas H. Cormen | author-link2 = Charles E. Leiserson | last2 = Leiserson | first2 = Charles E. | author-link3 = Ronald L. Rivest | last3 = Rivest | first3 = Ronald L. | author-link4 = Clifford Stein | last4 = Stein | first4 = Clifford |title=[[Introduction to Algorithms]] |orig-year=1990 |edition=2nd |date=2001-09-01 |publisher=MIT Press |___location=[[Cambridge, Massachusetts]] |isbn=978-0-262-03293-3 |pages=911–916 |chapter=The Rabin–Karp algorithm}}
* {{Cite journal |last1=Karp|first1= Richard M. | author-link=Richard Karp | last2=Rabin|first2=Michael O.|author2-link=Michael O. Rabin | title=Efficient randomized pattern-matching algorithms |date=March 1987 |journal=IBM Journal of Research and Development |volume=31 |issue=2|pages=249–260|doi=10.1147/rd.312.0249 |citeseerx = 10.1.1.86.9502}}
 
==External links==
*{{cite web | url = http://courses.csail.mit.edu/6.006/spring11/rec/rec06.pdf | work = MIT 6.006: Introduction to Algorithms 2011- Lecture Notes | title = Rabin–Karp Algorithm/Rolling Hash | publisher = MIT }}
 
{{Strings}}
* [http://www.research.ibm.com/journal/rd/312/ibmrd3102P.pdf Karp and Rabin's original paper]
 
{{DEFAULTSORT:Rabin-Karp String Search Algorithm}}
[[Category:Algorithms on strings]]
[[Category:String matching algorithms]]
[[Category:Hashing]]