Rabin–Karp algorithm: Difference between revisions

Content deleted Content added
Dcoetzee (talk | contribs)
Dcoetzee (talk | contribs)
Lots more details and paragraphing and explanation and pseudocode
Line 7:
 
== Use of hashing for shifting substring search ==
Rabin-Karp algorithm takes a very different approach. Rather than pursuing more sophisticated skipping, it seeks to speed up the testing of equality of the pattern to the substrings in the text by using a [[hash function|hashing]]. HashingA ahash stringfunction means computingis a numericalfunction valuewhich fromconverts theevery valuestring ofinto itsa charactersnumeric usingvalue, somecalled standardits ''hash functionvalue''; (afor veryexample, primitivewe onemight couldhave involve adding the [[ASCII]] values of all of its characters togetherhash("hello")=5. Rabin-Karp exploits the fact that if 2two strings are equal, their hash values are also equal. NoteThus, thatit thewould converseseem ofall thiswe statementhave isto notdo valid,is although good hash functions seek to reducecompute the number of such hash collisions. Rabin-Karp computes hash value of the pattern,substring and then goes through the text computing hash values of all of its substrings and checking whether or not the patternwe'sre hashsearching is equal to the substring hashfor, and advancing by 1 character every time. If the two numbers are indeed discovered to be equal, then thelook algorithm has to verify that the two string really are equal, rather than this beingfor a flukesubstring ofwith the hashing scheme. It does so using regular string comparison. It is this verification that in the worst case can drag down algorithm performance down to the level of the naive string search. This general approach is sometimes referred to as the use of a virtualsame hash tablevalue.
 
However, there are two problems with this. First, because there are so many different strings, to keep the hash values small we have to assign some strings the same number. This means that if the hash values match, the strings might not match; we have to verify that they do, which can take a long time for long substrings. Luckily, a good hash function promises us that on most reasonable inputs, this won't happen too often, which keeps the average search time good.
 
The algorithm is as shown:
 
{{wikicode}}
'''1''' '''function''' RabinKarp(''string'' s[1..n], ''string'' sub[1..m]) {
'''2''' hsub := hash(sub[1..m])
'''3''' hs := hash(s[1..m])
'''4''' '''for''' i '''from''' 1 '''to''' n
'''5''' '''if''' hs = hsub
'''6''' '''if''' s[i..i+m-1] = sub
'''7''' '''return''' i
'''8''' hs := hash(s[i+1..i+m])
'''9''' }
 
Lines 2, 3, and 6 each require [[Big-O notation|Ω]](m) time. However, lines 2 and 3 are only executed once, and line 6 is only executed if the hash values match, which is unlikely to happen more than a few times. Line 5 is executed ''n'' times, but only requires constant time. We now come to the second problem: line 8.
 
If we naively recompute the hash value for the substring <code>s[i+1..i+m]</code>, this would require [[Big-O notation|&Omega;]](''m'') time, and since this is done on each loop, the algorithm would require &Omega;(mn) time, the same as the most naive algorithms. The trick to solving this is to note that the variable <code>hs</code> already contains the hash value of <code>s[i..i+m-1]</code>. If we can use this to compute the next hash value in constant time, then our problem will be solved.
 
We do this using what is called a [[rolling hash]]. A rolling hash is a hash function specially designed to enable this operation. One simple example is adding up the values of each character in the substring. Then, we can use this formula to compute the next hash value in constant time:
s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]
This simple function works, but will result in statement 6 being executed more often than other more sophisticated rolling hash functions such as those discussed in the next section.
 
Notice that if we're very unlucky, or have a very bad hash function such as a constant function, line 6 might very well be executed ''n'' times, on every iteration of the loop. Because it requires &Omega;(m) time, the whole algorithm then takes a worst-case &Omega;(mn) time.
 
== Hash function used ==
The key to Rabin-Karp performance is the efficient computation of [[hash value]]s of the successive substrings of the text. Rabin-Karp achieves this by treating every substring as a number in some base, the base being usually a big [[prime]]. For example, if the substring is "hi" and the base is 101, the hash value would be 104 * 101^1 + 105 * 101^0 = 10609 ([[ASCII]] of 'h' is 104 and of 'i' is 105).

Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See [[hash function]] for a much more detailed discussion. The essential benefit achieved by such representation is that it is possible to compute the hash value of the next substring from the previous one by doing only a finiteconstant number of operations, independent of the substrings' lengthlengths.

For example, if we have text "abracadabra" and we are searching for a pattern of length 3, we can compute the hash of "bra" from the hash for "abr" (the previous substring) by subtracting the number added for the first 'a' of "abr", i.e. 97 * 101^2 (97 is ASCII for 'a' and 101 is the base we are using), multiplying by the base and adding for the last a of "bra", i.e. 97 * 101^0 = 97. If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes.

Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing by the first character and multiplying by the last. The limitation, however, is the limited of the size of integer [[data type]] and the necessity of using [[modular arithmetic]] to scale down the hash results, for which see [[hash function]] article; meanwhile, those naive hash functions that would not produce large numbers quickly, like just adding ASCII values, are likely to cause many [[hash collision]]s and hence slow down the algorithm. Hence the number in arbitrary basedescribed hash function is typically the preferred one in Rabin-Karp.
 
== Rabin-Karp and multiple pattern search ==
Rabin-Karp is inferior to [[Knuth-Morris-Pratt algorithm]], [[Boyer-Moore string searching algorithm]] and other faster single pattern [[string searching algorithm]]s because of its slow worst case behavior. However, Rabin-Karp is an algorithm of choice for multiple pattern search. That is, if we want to find any of a large number, say k, fixed length patterns in a text, a variant Rabin-Karp that uses a [[hash table]] to check whether the hash of a given string belongs to a set of hash values of patterns we are looking for. Other algorithms can search for a single pattern in time order O(n), hence they will search for k patterns in time order O(n*k). The variant Rabin-Karp will still work in time order O(n) in the best and average case because a hash table allows to check whether or not substring hash equals any of the pattern hashes in time order of O(1). We can also ensure O(''mn''log ''k'') time in the worst case, where ''m'' is the length of the longest of the ''k'' strings, by storing the hashes in a [[self-balancing binary search tree]] instead of a hash table.
 
That is, if we want to find any of a large number, say k, fixed length patterns in a text, we can create a simple variant of Rabin-Karp that uses a [[hash table]] or any other [[set data structure]] to check whether the hash of a given string belongs to a set of hash values of patterns we are looking for:
 
'''function''' RabinKarpSet(''string'' s[1..n], ''set'' of ''string'' subs, m) {
''set'' hsubs := emptySet
'''for each''' sub '''in''' subs
insert hash(sub[1..m]) into hsubs
hs := hash(s[1..m])
'''for''' i '''from''' 1 '''to''' n
'''if''' hs &isin; hsubs
'''if''' s[i..i+m-1] = a substring with hash hs
'''return''' i
hs := hash(s[i+1..i+m])
}
 
Here we assume all the substrings have a fixed length ''m'', but this assumption can be eliminated. We simply compare the current hash value against the hash values of all the substrings simultaneously using a quick lookup in our set data structure, and then verify any match we find against all substrings with that hash value.
 
Other algorithms can search for a single pattern in time order O(n), hence they will search for k patterns in time order O(n*k). The variant Rabin-Karp above will still work in time order O(n) in the best and average case, because a hash table allows to check whether or not substring hash equals any of the pattern hashes in time order of O(1). We can also ensure O(''mn''log ''k'') time in the worst case, where ''m'' is the length of the longest of the ''k'' strings, by storing the hashes in a [[self-balancing binary search tree]] instead of a hash table.
 
==External links==