Talk:Aho–Corasick algorithm: Difference between revisions

Content deleted Content added
No edit summary
Tags: Mobile edit Mobile web edit Advanced mobile edit
 
(15 intermediate revisions by 7 users not shown)
Line 1:
{{WikiProject Computerbanner scienceshell|class=startStart|importance=mid}}
{{WikiProject Computer science|importance=mid}}
}}
 
== Untitled ==
Something is wrong with the first sentence in the "Informally, the algorithm constructs a trie ..." paragraph -- some kind of copy-paste replacement error; haven't figured out exactly how to fix it yet. Looks like there isn't a clean version to go back to; the mismatched parenthetical expression hasn't changed since it was added (17:36, 27 July 2006). Anyone? [[User:Dvgrn|dvgrn]] 18:46, 19 September 2006 (UTC)
* Got rid of the obvious error, but may have introduced some subtle inaccuracy -- will have to study the algorithm some more to be sure. [[User:Dvgrn|dvgrn]] 20:40, 19 September 2006 (UTC)
Line 43 ⟶ 46:
 
::I didn't look at either one, just wondered why one and not the other. Funny story, it seems that the algorithm was rediscovered by people working at the NCBI <ref name="NCBI">{{cite web|title=National Center for Biotechnology Information|url=http://www.ncbi.nlm.nih.gov|website=www.ncbi.nlm.nih.gov|publisher=NCBI|accessdate=21 September 2016}}</ref> in developing BLAST <ref name="BLAST">{{cite web|title=BLAST|url=https://blast.ncbi.nlm.nih.gov/Blast.cgi|website=blast.ncbi.nlm.nih.gov|publisher=NCBI|accessdate=21 September 2016}}</ref> software for DNA and protein sequence searching. I used their implementation, and cited it, for a project I was working on, and then someone told me that it was Aho-Corasick. I told the BLAST people, and they agreed. I suppose I could add BLAST to the external links section. It is definitely production level software, produced with government funding, not someone at home. I believe it is all C. [[User:Gah4|Gah4]] ([[User talk:Gah4|talk]]) 18:45, 21 September 2016 (UTC)
 
{{reflist-talk}}
 
== What are the input what is the output ==
Line 54 ⟶ 59:
: Apart from both having applications in text mining, they are nothing alike. Aho-Corasick is a string searching algorithm, while TF-IDF is a term weighting statistic. Given a dictionary, Aho-Corasick builds a tree-like data structure for efficiently matching all the dictionary strings against a text document (i.e., just another string). On the other hand, TF-IDF measures the relevance of a given term in a given document from a particular collection of documents. Let's say that a user is searching the web and inputs a keyword query in a search engine. Then, the TF-IDF for each term in the query can be computed for each document in the collection and used to compute a final document score. Then, documents are ranked from highest to lowest score, in order to present the most relevant documents to the user. Aho-Corasick, on the other hand, might be useful on a situation where you want to identify, let's say, people's names in a collection of documents. You might use the people's names as your dictionary and then process each document to find matches. [[User:José Devezas|José Devezas]] ([[User talk:José Devezas|talk]]) 10:22, 6 June 2018 (UTC)
 
== bab not in tree ==
 
In the presented example the word bab is mentioned among the words, but it is not represented in the automaton... Oh there is another automaton shown later ... the correct one ... the word this would be better replaced by "bellow" [[Special:Contributions/217.153.187.164|217.153.187.164]] ([[User talk:217.153.187.164|talk]]) 18:21, 27 October 2022 (UTC) sorry, I have to remember my account info
 
== Quadratic? ==
 
The phrase in the first paragraph (underlining is mine):
: Note that because all matches are found, there can be a <u>quadratic</u> number of matches if every substring matches
puzzled me. Quadratic with respect to what? Surely there cannot be more matches than strings, so it seems linear WRT the strings. [[User:Викидим|Викидим]] ([[User talk:Викидим|talk]]) 03:06, 22 July 2024 (UTC)
 
:The statement is meant to be "quadratic in length of input", where length of input is total number of characters. But that claim is wrong, it cannot be quadratic - it is a nice exercise to show that there's an instance with <math>\Theta(n \sqrt(n))</math> matches, and that it cannot be asymptotically more. [[User:Al-Quaknaa|Martin Koutecký]] ([[User talk:Al-Quaknaa|talk]]) 19:27, 9 October 2024 (UTC)
::Removed "quadratic". I think that the original statement implied that as the length extends, so does the potential number of entries in the dictionary. [[User:Викидим|Викидим]] ([[User talk:Викидим|talk]]) 23:39, 9 October 2024 (UTC)
:::I've made one more clarification. I think it's important to point out that, unlike with Knuth-Morris-Pratt, you cannot expect complexity linear in the length of the input, so linear in the length of input + output is the best possible. [[User:Al-Quaknaa|Martin Koutecký]] ([[User talk:Al-Quaknaa|talk]]) 11:11, 10 October 2024 (UTC)
::::Undid a part of your change. I do not understand the "longer" part: the input and output of the algorithm are two different data structures, comparing their lengths is definitely nontrivial. [[User:Викидим|Викидим]] ([[User talk:Викидим|talk]]) 17:58, 10 October 2024 (UTC)
:::::I don't understand what's the issue.
:::::The input is a collection of strings and we measure its length by counting the characters in it.
:::::The output is a set of pairs (i,j) such that string i appears at index j. The number of such pairs may be (much) larger than the length of the input.
:::::Since you were not happy with my phrasing, could you please come up with a better phrasing communicating the same fact? In my opinion, the current state is a downgrade from the incorrect statement that there may be quadratically many matches, but at least that conveyed that the "number of matches" really has to be a part of the complexity. This has now gotten lost. [[User:Al-Quaknaa|Martin Koutecký]] ([[User talk:Al-Quaknaa|talk]]) 13:10, 23 October 2024 (UTC)
::::::I have no read-made language as I in turn do not understand your concern. My points are simple:
::::::# While it is easy for me to understand the meaning of words ''string "aaaaa" requires more space than "a"'', I cannot easily imagine a straightforward comparison of sizes of the same string "aaaaa" and of a [[tuple]] (1,1). Either the assumptions shall be listed or the comparison adjectives skipped.
::::::# Similarly, the generic meaning of ''quadratic'' describes a relationship between x and y of the form y = O(x*x). When declaring something to be quadratic, the text should identify not only y, but x also.
::::::# I am not an expert, but to the best of my knowledge AC algorithm is used primarily to find the threat patters in the network traffic. It does not appear to me that the "a", "aa", "aaa", ... is a typical dictionary for this application. I would like therefore to add something along the lines of "pathological" or "atypical" to the description of this example (this requires a [[WP:RS]], of course).
::::::[[User:Викидим|Викидим]] ([[User talk:Викидим|talk]]) 21:44, 24 October 2024 (UTC)