Talk:Rabin–Karp algorithm

This is an old revision of this page, as edited by Dkf11 (talk | contribs) at 22:12, 3 January 2010 (Restore of simple algorithm: new section). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Latest comment: 15 years ago by Dkf11 in topic Restore of simple algorithm

Extending the algorithm

I think it would be nice if the article discussed extending the algorithm for 2 dimensional pattern matching, as well as giving some optimizations in the case of having varying string lengths for multi-pattern matching case.

Concerns

In the pseudocode of the RK string search algorithm, if hs is given as hash(s[1..n]), the algorithm can not detect a target pattern which occurs on the first m bytes of a given text. This is why I reverted it to be hash(s[1...m]).

One arduous but impetuous, so-called, vandalism detector claims that it should be hash(s[1...n]), because, when m is larger than n, the code crashes. That is true, but I reckon such a range check matter is not a major issue in describing the fundamentals of the algorithm.

Considering the following sentence which is excerpted from the article and is supposed to be written by one of the original contributors,

"Lines 2,3, and 6 each require omega(m) time."

if one still thinks hs := hash(s[1...n]) is correct, why doesn't he/she also modify the above sentence too?

footnote) The line 3 of the pseudocode has been in the form of hash(s[1..m]) for the last six months or so since its original edition by the original writer.

--129.254.65.18 08:35, 23 May 2005 (UTC)Reply

I agree with you. We are assuming m ≤ n, and if [1..n] is used, the algorithm won't work, because it'd be trying to compare hashes of strings of length n to hashes of strings of length m. Deco 21:41, 23 May 2005 (UTC)Reply

Line 8 of the multi-pattern-search pseudo-code seems wrong to me

Quote: if s[i..i+m-1] = a substring with hash hs This is exactly the same condition as in the line above (element of hash). As pointed out corretly in the single patters search - to rule out a hash collision we need to compare with the actual search string(s). (Plural only if by fat chance or denial of service attack, two search strings have the same hash.)

Or am I missing something obvious here?--iSee 10:16, 28 July 2005 (UTC)Reply

I need to correct myself. The code is ok, although I don't quite get how the rolling hash would compare varying key-lengths. I improved the code indentation so the code is more readable iSee

No worst case

I removed the following text in the "multiple pattern search" section: "We can also ensure O(m n log k) time in the worst case, where m is the length of the longest of the k strings, by storing the hashes in a self-balancing binary search tree instead of a hash table." Rabin-Karp simply doesn't make sense in the worst case, because the whole idea of hashing requires randomization and can only work in expectation (or with high probability).

That's true. My mistake. Deco 02:51, 31 July 2005 (UTC)Reply

Suggestions

Firstly, it seems like

 1 function RabinKarp(string s[1..n], string sub[1..m])
 2     hsub := hash(sub[1..m])
 3     hs := hash(s[1..m])
 4     for i from 1 to n
 5         if hs = hsub
 6             if s[i..i+m-1] = sub
 7                 return i
 8         hs := hash(s[i+1..i+m])
 9     return not found

could be shortened to

 1 function RabinKarp(string s[1..n], string sub[1..m])
 2     hsub := hash(sub[1..m])
 3     for i from 1 to n
 4         hs := hash(s[i..i+m-1])
 5         if hs = hsub
 6             if s[i..i+m-1] = sub
 7                 return i
 8     return not found

Aside from being a line shorter, I find this more clearly displays what hs is for any given iteration. The same comment applies to the other example of multi-pattern matching. (Edit: Ignore this suggestion--of course you want to keep the structure of the code the way it is to make it easy to introduce rolling hash functions.)

On the topic of the multi-pattern matching code, I think it is worth noting in both the code and the surrounding discussion the possibility for hash table collisions. As it stands, one might wrongly infer from the code that the only possibility is that a hash table either has no entry or a single entry for any given hash code.

Thanks for looking over the article. The collisions are internal to the implementation of the hash table. Here we are treating it as an abstract set data structure, and don't even really care whether it's a hash table, binary search tree, or association list. If you think you can make this clearer please feel free. Deco 05:14, 15 November 2005 (UTC)Reply

StTwister 15:39, 25 January 2007 (UTC): In order to be possible to change compute the hash of the next subsequence of S in constant time (to keep a low complexity), the hash must first be calculated outisde the loop entirely (with a 1 to M iteration) and then only updated inside the loop (remove the ith element from the hash and add the (i+m)-th element to the hash). So I think it's clearer if left this way.Reply

May 23 2007: CRCs are another possible rolling hash function. If you google "Rabin Hash", you'll find a SourceForge project implementing a CRC in Java (though they call it a Rabin fingerprint instead of a CRC).

Some pseudocode fixes (both single and multiple pattern search)

Firstly, changed this in single pattern search:

 hs := hash(s[1..n])

to

 hs := hash(s[1..m])

Secondly, changed the loop from (1 to n) to (1 to n-m+1), because from that points onwards, the hash function would be impossible to calculate (it would get out of sub's range), in both single and multiple pattern search.

Changes by StTwister 15:39, 25 January 2007 (UTC)Reply


Invalid character in last code segmet for searching for server strings

Perhaps it is a browser problem, but on line 7 of the pseudo-code, it appears to be

        if hs ∈ hsubs

which looks like "if hs SQUARE-SYMBOL hsubs". If that is the intended symbol, there should be some explanation of its meaning. —Preceding unsigned comment added by 134.129.106.244 (talk) 00:17, 30 March 2008 (UTC)Reply

That's a browser issue, it's the element-of mathematical symbol. What browser are you on? I generally recommend not using characters that IE6 does not render. Dcoetzee 21:56, 21 April 2008 (UTC)Reply

Probably there is something I have done wrong. But I took a look at the original paper (where the wiki points to) and also the section in the "Introduction to Algorithms" book, but none of them talk about the section in the wiki that explains mulitple pattern search using Bloom filter. Could the reference for that part be posted? is that another paper newer than the original? I haven't been able to find the source.Michael.Do.Wiki (talk) 20:25, 11 March 2009 (UTC)Reply

Referring to

Here we assume all the substrings have a fixed length m, but this assumption can be eliminated. We simply compare the current hash value against the hash values of all the substrings simultaneously using a quick lookup in our set data structure, and then verify any match we find against all substrings with that hash value.

It is not enough to compare the current hash value against the hash values of all the substrings because the current hash value is the hash of the 'm' chars from current ___location and we would like to check all the lengths from the current ___location.

For example we have three substrings: "apple" "chair" "cube" and our search text is "the cubes are blue". when we reach current ___location=4 in the search text ('c') (assuming index starts from 0). So our substring lengths ('m' values) are 5, 5 and 4. So we want to compare the hashes from the search text of:

1. "cubes" (m=5) with hashes of substrings with length of 5 ("apple" & "chair")

and

2. "cube" (m=4) with the hashes of substrings with length of 4 ("cube") —Preceding unsigned comment added by Thedrs (talkcontribs) 15:51, 16 June 2009 (UTC)Reply

Aho and Corasick's algorithm from 1975 finds k patterns in a text of length n in O(n) time after preprocessing. This fact contradicts the statement currently made in the article, namely that this property is unique to Rabin-Karp. 12:20, 19 August 2009 (UTC) —Preceding unsigned comment added by AmirOnWiki (talkcontribs)

To address above concern I have added a link where string search algorithms using finite set of patterns are listed. --Mahmutuludag (talk) 17:18, 19 November 2009 (UTC)Reply

Restore of simple algorithm

I restored the simple algorithm presentation to this page on the grounds that other parts of that section refer to it, making the overall page hard to understand if it isn't there. I hate calling other WP users incompetent, but this felt a lot like it; the removalist didn't even try to read the page afterwards to see if it made basic sense! (Whether or not the example should be here is not my concern; the page should at least read as a consistent entity...) --Donal Fellows (talk) 22:12, 3 January 2010 (UTC)Reply