Talk:Rabin–Karp algorithm: Difference between revisions

Content deleted Content added
AmirOnWiki (talk | contribs)
 
(33 intermediate revisions by 6 users not shown)
Line 1:
{{WikiProject banner shell|class=C|
{{WikiProject Computer science |importance=mid |class=C}}
{{WikiProject Computing |class=C |importance=Low |science=y |science-importance=Mid |software=y |software-importance=Low}}
}}
 
==Extending the algorithm==
Line 171 ⟶ 174:
*<s>'''Support''' With both forms in use, I'm fine defaulting to alphabetical order. --[[User:BDD|BDD]] ([[User talk:BDD|talk]]) 17:50, 29 April 2014 (UTC)</s>
*'''Neutral'''. Either is good for me. Cormen uses RK, and RK sounds better KR (which may be why author order was switched -- leading with a double stop is awkward). [[User:Glrx|Glrx]] ([[User talk:Glrx|talk]]) 23:37, 29 April 2014 (UTC)
*'''Strong, speedy support''' - dang, somewhere someone messed up big time to botch this order. Follow the order they wrote it in. [[User:Red Slash|<fontspan colorstyle="color:#FF4131;">Red </fontspan>]][[User talk:Red Slash|<b><font colorstyle="color:#460121;">Slash</font></b>]] 00:39, 30 April 2014 (UTC)
* '''Oppose''' – in both scholar search and book search, Rabin–Karp is somewhat more common than Karp–Rabin (at least, when I look). I don't know why, but I think it's OK to stick with what it's more commonly called. Unless someone can show that these are due to wiki mirroring. [[User:Dicklyon|Dicklyon]] ([[User talk:Dicklyon|talk]]) 04:27, 2 May 2014 (UTC)
**FWIW, Google scholar gives me about 820 hits for "Karp-Rabin" (as a quoted string) and about 733 for "Rabin-Karp". I don't put much stock in general web searches for this sort of topic. —[[User:David Eppstein|David Eppstein]] ([[User talk:David Eppstein|talk]]) 05:05, 2 May 2014 (UTC)
Line 180 ⟶ 183:
***My observation was based on a quick look at Google books. [https://www.google.com.au/search?q=%22Karp-Rabin%22+algorithm&tbm=bks] [https://www.google.com.au/search?q=%22Rabin-Karp%22+algorithm&tbm=bks] It's not an overwhelming result, but the trend looked clear to me. Certainly enough to require more evidence before moving. [[User:Andrewa|Andrewa]] ([[User talk:Andrewa|talk]]) 08:02, 7 May 2014 (UTC)
<hr />
:''The above discussion is preserved as an archive of a [[WP:RM|requested move]]. <span style="color:red">'''Please do not modify it.'''</span> Subsequent comments should be made in a new section on this talk page or in a [[WP:move review|move review]]. No further edits should be made to this section.''</div><!-- Template:RM bottom -->
 
== confusing example data in section 'Hash function used' ==
Line 196 ⟶ 199:
== Magazine article ==
 
The whole article reads like a popular magazine article.
 
The text says: ''the base being usually the size of the character set.'' The character set used is ASCII, since the pseudocode examples prominently display it, but the base used is 256.ASCII is 7-bit codes. But the prime modulus chosen was 101, suspiciously close to the number of printable ASCII characters, 95. It appears to me there was gross confusion about what the base and modulus are supposed to represent, and how they are to be chosen to yield a good hash function.
 
The text says: ''The essential benefit achieved by using a rolling hash such as the Rabin fingerprint...'' A Rabin fingerprint is not a rolling hash - it is a fingerprinting function, similar to a checksum. It may or not be used in a rolling hash, and is most of the time used elsewhere.
 
The text says: ''this algorithm is only similar to the true number in a non-decimal system representation...'' How can an algorithm be similar to a number? True number of what??
 
The text says: ''The hash function described here is not a Rabin fingerprint, but it works equally well.'' It does NOT work equally well; in fact it uses a rather poorly chosen prime number. A Rabin fingerprint is carefully designed to reduce collisions to a minimum, possibly to zero. The chosen hash function certainly does not.
 
Watch the technical language: the text says, ''The Rabin fingerprint is a popular and effective rolling hash function.'' No, it isn't a rolling hash function, it's just a hash function, a fingerprinting hash function, to be precise. The term 'rolling hash function' is imprecise diction in any case: rolling hashing is an algorithm that uses a hash function to produce successive hash codes over fixed- length portions of a string. Such a hash code might be called a 'rolling hash'. Furthermore, a Rabin fingerprint isn't confined to rolling hashes - it can be used anywhere a hash function is used.
 
For example, ''A brute-force substring search algorithm checks all possible positions: [pseudocode omitted] This algorithm works well in many practical cases,''. We do this in Computer Science 101, when we're learning about nested loops. About the best we can say about this algorithm is that it will find the substring(s). String search is a critical software function, and doing it well requires a scholarly approach. Brute force is hardly ever a good solution in the software world.
Line 221 ⟶ 234:
Nobody should do that, nobody would do that. We're not tutoring grade schoolers; this is supposed to be a scholarly article. A rolling hash isn't a trick, it's a well-worn technique.
 
All that pseudocode is how-to (see [[WP:NOTHOWTO]]), and looks to me like original art (see [[WP:NOR]]), which can't be in the encyclopedia. If it's copied from a professional publication, where's the citation(see [[WP:RS]])? And if it is, I'd have to say it's copyrighted (GULP! See [[WP:COPYVIO]]) So we have a triple whammy here.
 
The section '''Shifting substrings search and competing algorithms''' isn't about Rabin-Karp, and shouldn't be in the article. All that may be needed is one sentence stating that the naive approach of scanning the text string a character at a time and then doing a string match (via a hash or any method) at that position, is O(nm). It's self-evident, no need to ramble on.
Line 228 ⟶ 241:
 
I could go on, but that's enough for now. The article needs to be redrafted entirely for [[WP:TONE]], and it ''must, must, must'' be cited inline, every paragraph, factoid, piece of code or pesudocode, formula, or anything else that might constitute researched and referenceable content. [[User:Sbalfour|Sbalfour]] ([[User talk:Sbalfour|talk]]) 15:46, 26 September 2019 (UTC)
 
==Infeasability==
The algorithm as given is infeasable: it does ''m'' mods (i.e. divides) including doing the exponentiation, to subtract the high character from the running hash, times ''n'' character positions in the text string = ''nm'' mods. Divide is microprogrammed on modern architectures, and is maybe 3-5 timers slower than multiply, and maybe 50-100 times slower than ADD/XOR/SHIFT. You'll give back all or most of the advantage over linear hashing. [[User:Sbalfour|Sbalfour]] ([[User talk:Sbalfour|talk]]) 23:13, 26 September 2019 (UTC)
 
== Rewrite ==
 
I've been a computer scientist and mathematician for 27 years. I can't work with this kind of topical and at the same time, minutely detailed presentation. The theoretical and technical presentation is absent. This article necessarily involves the GF(2) field and must discuss polynomial remaindering in the field. What's here is probably distilled from a high school lecture, and is the fabrication of the editors. It can't stand up to professional scrutiny. It doesn't give credit to those whose work is relied upon.
Proper mathematical notation and technical diction is entirely absent. We need to start over.[[User:Sbalfour|Sbalfour]] ([[User talk:Sbalfour|talk]]) 02:24, 27 September 2019 (UTC)
 
I'm starting over, section by section, and giving a more formal presentation. If you don't like reading math, you probably don't need a Rabin-Karp hash function. Probably, nobody much needs one.[[User:Sbalfour|Sbalfour]] ([[User talk:Sbalfour|talk]]) 20:31, 27 September 2019 (UTC)
:I don't dispute that the article can use help, but please keep in mind [[WP:TECHNICAL]] and write for the widest audience that can understand this. Since this is commonly taught in undergraduate computer science classes, I think this should at least be readable by freshman computer scientists. Technicality and formality for the sake of technicality and formality (or, to put it more bluntly, obscurantism) are the opposite of helpful here. —[[User:David Eppstein|David Eppstein]] ([[User talk:David Eppstein|talk]]) 23:01, 27 September 2019 (UTC)
::Actually, I think you're right about the general target level for hash functions. However, Rabin-Karp with Rabin fingerprinting necessarily implies knowledge of GF, and that belongs to a college upperclassman in mathematics, or a graduate student in computer science. An undergraduate will not likely come upon Rabin-Karp; it's useful only in exotic situations. I'm dithering over whether it's implicitly graduate level subject matter that I'm going to have to dumb down, and dumbing it down will make it readable, but useful to whom? I don't think I've ever used a rolling hash, and I've written editors and plagiarism detectors and other natural language processing software.[[User:Sbalfour|Sbalfour]] ([[User talk:Sbalfour|talk]]) 23:15, 27 September 2019 (UTC)
:::Modular arithmetic is high school mathematics. You don't need to know Galois theory to understand it. —[[User:David Eppstein|David Eppstein]] ([[User talk:David Eppstein|talk]]) 01:16, 28 September 2019 (UTC)
::::We shouldn't care too much about arithmetic, and more about the properties of the function. They used a modulus of 101 - if we're searching for a string of 3 characters out of a character set of 96 printable characters (but their base was actually 8-bit bytes), there are 96<sup>3</sup>/101 = ~10,000 strings that can map to each hash value. If we use their base, it's 2<sup>24</sup>/101 binary values = ~160,000 strings. What they used as a modulus close to the cardinality of the character set, is supposed to be the base, and the modulus is an irreducible polynomial over the field relatively prime to the pattern's polynomial. That polynomial is big and choosing it relies on some reduction properties of the field. They used numbers without understanding how operating in the field yields a unique representation of a string, i.e. it's fingerprint. The article is junk. [[User:Sbalfour|Sbalfour]] ([[User talk:Sbalfour|talk]]) 17:05, 28 September 2019 (UTC)
:::::Maybe, but edits like [[Special:Diff/918281340|this one]] of yours are no better. Putting polynomials and finite fields into the first sentence of the article is completely inappropriate, especially when the main point of Rabin–Karp has nothing to do with the specific rolling hash used (replace it with any other rolling hash and it would still be the Rabin–Karp algorithm). Also your replacement of asymptotic comparisons with vague and unsourced editorializations ("a marginal advantage") is again completely inappropriate. And your version makes roughly the first half of the article be general considerations about string search that have nothing to do with this specific algorithm; that's far out of balance. I have reverted to an older version before your disimprovements. Stop worrying about putting the details of the hash function into the lead and focus on the algorithm itself. It's not as complicated as you want to make it. Just use a rolling hash function to reduce the number of times you do a more careful exact match. That's what should be in the lead. The details are for later. —[[User:David Eppstein|David Eppstein]] ([[User talk:David Eppstein|talk]]) 17:31, 28 September 2019 (UTC)
::::::You're a distinguished Professor of Computer Science and an ACM fellow! (GULP) You're more qualified to write this article than me, and I wasn't very familiar with it when I started. Throw in something, and I'll do, too. Starting with the definition in the first sentence. Who says that it can find multiple patterns? The seminal paper listed below just says it finds the first occurrence of a pattern in text... i.e. not even multiple occurrences of one pattern.[[User:Sbalfour|Sbalfour]] ([[User talk:Sbalfour|talk]]) 18:24, 28 September 2019 (UTC)
:::::::Just keep running the algorithm and it will find multiple instances of the same pattern. Use a separate hash table to store the hashes of the patterns and it will be able to find multiple different patterns. The same idea, easily generalized. We need published sources for that of course but they don't have to be in the original paper. The subtlety is that the expected time becomes O(n + sum of lengths of matches), which could be larger than n for some inputs. (Also technically I'm not a distinguished professor; that's a specific job title that I don't currently have.) —[[User:David Eppstein|David Eppstein]] ([[User talk:David Eppstein|talk]]) 18:32, 28 September 2019 (UTC)
 
== Why does this work? (resolved) ==
 
The hash function suggested in [[Rabin–Karp algorithm#Hash function used]] goes over each character. So I don't see how using the hash is faster than just comparing each character &nbsp;[[User:AltoStev|<span style='color: orange;'>'''AltoStev'''</span>]] ([[User Talk:AltoStev|<span style='color: #448cff'>talk</span>]]) 23:00, 28 September 2024 (UTC)
 
:Comparing each character to what? A naive search would compare each pair of (position in first string, position in second string). That could be a lot more pairs than characters. —[[User:David Eppstein|David Eppstein]] ([[User talk:David Eppstein|talk]]) 23:40, 28 September 2024 (UTC)
::<s>That would be the same amount of pairs as characters, as the number of pairs is equal to the number of positions, which is equal to the number of characters in the string. &nbsp;[[User:AltoStev|<span style='color: orange;'>'''AltoStev'''</span>]] ([[User Talk:AltoStev|<span style='color: #448cff'>talk</span>]]) 12:09, 30 September 2024 (UTC)</s>
::Oh I see what you mean. I meant, for each index i, the pair (s[i], s_2[i]). &nbsp;[[User:AltoStev|<span style='color: orange;'>'''AltoStev'''</span>]] ([[User Talk:AltoStev|<span style='color: #448cff'>talk</span>]]) 12:20, 30 September 2024 (UTC)
:::But that would only test two strings for equality. This algorithm tests whether one is a substring of the other, at any position. —[[User:David Eppstein|David Eppstein]] ([[User talk:David Eppstein|talk]]) 20:17, 30 September 2024 (UTC)
 
== Probabilistic analysis ==
 
There is a vauge statement that the expected time of the algorithm is linear. It would be nice if someone can flesh this out. [[User:AmirOnWiki|AmirOnWiki]] ([[User talk:AmirOnWiki|talk]]) 14:09, 23 June 2025 (UTC)