Rabin–Karp algorithm: Difference between revisions

Content deleted Content added
Dcoetzee (talk | contribs)
A simple application after the intro
Dcoetzee (talk | contribs)
More details
Line 1:
'''Rabin-Karp algorithm''' is a [[string searching algorithm]] that seeks a pattern, i.e. a substring, within a text by using [[hashing]]. It is not widely used for single pattern matching, but is of considerable theoretical importance and is very effective for multiple pattern matching. Historically, it predates the very popular [[Knuth-Morris-Pratt algorithm]]. For text of length ''n'' and pattern of length ''m'', its average and best case running time is [[Big-O(n) notation|O]](meaning proportional to ''n''), but the (highly unlikely) worst case scenarioperformance is O(n*m''nm''), which is one of the reasons why it is not widely used. However, it has the unique advantage of being able to find any one of ''k'' strings or less in O(''n'') time on average, regardless of the size of ''k''.
 
One of the simplest practical applications of Rabin-Karp is in detection of plagiarism. Say, for example, that a student is writing an English paper on ''[[Moby Dick]]''. A cunning professor might locate a variety of source material on Moby Dick and automatically extract a list of all sentences in those materials. Then, Rabin-Karp can rapidly search through a particular paper for any instance of any of the sentences from the source materials. To avoid easily thwarting the system with minor changes, it can be made to ignore details such as case and punctuation by removing these first.
Line 13:
 
== Rabin-Karp and multiple pattern search ==
Rabin-Karp is inferior to [[Knuth-Morris-Pratt algorithm]], [[Boyer-Moore string searching algorithm]] and other faster single pattern [[string searching algorithm]]s because of its slow worst case behavior. However, Rabin-Karp is an algorithm of choice for multiple pattern search. That is, if we want to find any of a large number, say k, fixed length patterns in a text, a variant Rabin-Karp that uses a [[hash table]] to check whether the hash of a given string belongs to a set of hash values of patterns we are looking for. Other algorithms can search for a single pattern in time order O(n), hence they will search for k patterns in time order O(n*k). The variant Rabin-Karp will still work in time order O(n) in the best and average case because a hash table allows to check whether or not substring hash equals any of the pattern hashes in time order of O(1). We can also insure O(''mn''log ''k'') time in the worst case, where ''m'' is the length of the longest of the ''k'' strings, by storing the hashes in a [[self-balancing binary search tree]] instead of a hash table.
 
==External links==