Aho–Corasick algorithm: Difference between revisions

Content deleted Content added
Describe dictionary suffix link
History: Fixed typo
Tags: Mobile edit Mobile web edit
 
(140 intermediate revisions by 93 users not shown)
Line 1:
{{Short description|String-searching algorithm}}
In [[computer science]], the '''Aho–Corasick string matching algorithm''' is a [[string searching algorithm]] invented by [[Alfred V. Aho]] and [[Margaret J. Corasick]]. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns simultaneously. The [[Computational complexity theory|complexity]] of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = <code>a</code>, <code>aa</code>, <code>aaa</code>, <code>aaaa</code> and input string is <code>aaaa</code>).
{{Infobox algorithm
|name = Aho–Corasick Algorithm
|class = String Searching, String Matching
|image = A diagram of the Aho-Corasick string search algorithm.svg
|caption = A diagram of an Aho—Corasick automaton
|data = [[Finite-state machine]] of strings
|time = <!-- Worst time big-O notation -->
|best-time =
|average-time =
|space = <!-- Worst-case space complexity; auxiliary space (excluding input) if not specified -->
}}
 
In [[computer science]], the '''Aho–Corasick algorithm''' is a [[string-searching algorithm]] invented by [[Alfred V. Aho]] and Margaret J. Corasick in 1975.<ref>{{cite journal |first1=Alfred V. |last1=Aho |authorlink=Alfred Aho |first2=Margaret J. |last2=Corasick |date=June 1975 |title=Efficient string matching: An aid to bibliographic search |journal=Communications of the ACM |volume=18 |issue=6 |pages=333&ndash;340 |doi=10.1145/360825.360855 |mr=0371172|s2cid=207735784 |doi-access=free }}</ref> It is a kind of dictionary-matching [[algorithm]] that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all strings simultaneously. The [[time complexity|complexity]] of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches. Because all matches are found, multiple matches will be returned for one string ___location if multiple strings from the dictionary match at that ___location (e.g. dictionary = {{mono|a}}, {{mono|aa}}, {{mono|aaa}}, {{mono|aaaa}} and input string is {{mono|aaaa}}).
Informally, the algorithm constructs a [[finite state machine]] that resembles a [[trie]] with additional links between the various internal nodes. These extra internal links allow fast transitions between failed pattern matches (e.g. a search for <code>cat</code> in a trie that does not contain <code>cat</code>, but contains <code>cart</code>, and thus would fail at the node prefixed by <code>ca</code>), to other branches of the trie that share a common prefix (e.g., in the previous case, a branch for <code>attribute</code> might be the best lateral transition). This allows the automaton to transition between pattern matches without the need for backtracking.
 
Informally, the algorithm constructs a [[finite-state machine]] that resembles a [[trie]] with additional links between the various internal nodes. These extra internal links allow fast transitions between failed string matches (e.g. a search for {{mono|cart}} in a trie that does not contain {{mono|cart}}, but contains {{mono|art}}, and thus would fail at the node prefixed by {{mono|car}}), to other branches of the trie that share a common suffix (e.g., in the previous case, a branch for {{mono|attribute}} might be the best lateral transition). This allows the automaton to transition between string matches without the need for backtracking.
When the pattern dictionary is known in advance (e.g. a [[computer virus]] database), the construction of the automaton can be performed once off-line and the compiled automaton stored for later use. In this case, its run time is linear in the length of the input plus the number of matched entries.
 
When the string dictionary is known in advance (e.g. a [[computer virus]] database), the construction of the automaton can be performed once off-line and the compiled automaton stored for later use. In this case, its run time is linear in the length of the input plus the number of matched entries.
The Aho–Corasick string matching algorithm formed the basis of the original [[list of Unix programs|Unix command]] [[grep#Variations|fgrep]].
 
The Aho—Corasick string-matching algorithm formed the basis of the original [[List of Unix commands|Unix command]] [[grep#Implementations|fgrep]].
The graph below is the Aho–Corasick data structure constructed from the specified dictionary, with each row in the table representing a node in the trie, with the column path indicating the (unique) sequence of characters from the root to the node.
 
==History==
The data structure has one node for every prefix of every string in the dictionary. So if (bca) is in the dictionary, then there will be nodes for (bca), (bc), (b), and (). There is a black directed "child" arc from each node to a node whose name is found by appending one character. So there is a black arc from (bc) to (bca). There is a blue directed "suffix" arc from each node to the node that is the longest possible strict suffix of it in the graph. For example, for node (caa), its strict suffixes are (aa) and (a) and (). The longest of these that exists in the graph is (a). So there is a blue arc from (caa) to (a).
 
Like many inventions at Bell Labs at the time, the Aho–Corasick algorithm was created serendipitously with a conversation between the two after a seminar by Aho. Corasick was an information scientist who got her PhD a year earlier at Lehigh University. There, she did her dissertation on securing proprietary data within open systems, through the lens of both the commercial, legal, and government structures and the technical tools that were emerging at the time.<ref>{{cite thesis |last=Corasick |first=Margaret J. |title=A Study of the Means of Protection and the Legal Framework in Which Such Protection Is Undertaken |year=1974 | institution =Lehigh University |url=https://preserve.lehigh.edu/_flysystem/fedora/2024-12/7510364.pdf}}</ref> In a similar realm, at Bell Labs, she was building a tool for researchers to learn about current work being done under government contractors by searching government-provided tapes of publications.
The "dictionary suffix link" is an additional arc from a node to the first dictionary entry that can be reached from that node by following blue arrows (if any). The graph below doesn't show the dictionary suffix links, but there would be one from (bca) to (a), if those links were shown. The table next to the graph does show the dictionary suffix link.
 
For this, she wrote a primitive keyword-by-keyword search program to find chosen keywords within the tapes. Such an algorithm scaled poorly with many keywords, and one of the bibliographers using her algorithm hit the $600 usage limit on the Bell Labs machines before their lengthy search even finished.
[[File:Aho Corasick Concept.PNG|left]]
 
She ended up attending a seminar on algorithm design by Aho, and afterwards they got to speaking about her work and this problem. Aho suggested improving the efficiency of the program using the approach of the now Aho–Corasick algorithm, and Corasick designed a new program based on those insights. This lowered the running cost of that bibliographer's search from over $600 to just $25, and Aho–Corasick was born.<ref>{{cite video |last=Aho |first=Alfred |title=Alfred V. Aho Oral History |url=https://www.youtube.com/watch?v=ir9Mwl9zhIM&t=3598s |publisher=YouTube |date=2023-08-12 |access-date=2025-04-18}}</ref>
 
==Example==
In this example, we will consider a dictionary consisting of the following words: {a, ab, bab, bc, bca, c, caa}.
 
The graph below is the Aho–Corasick data structure constructed from the specified dictionary, with each row in the table representing a node in the trie, with the column path indicating the (unique) sequence of characters from the root to the node.
 
The data structure has one node for every prefix of every string in the dictionary. So if (bca) is in the dictionary, then there will be nodes for (bca), (bc), (b), and (). If a node is in the dictionary then it is a blue node. Otherwise it is a grey node.
 
There is a black directed "child" arc from each node to a node whose name is found by appending one character. So there is a black arc from (bc) to (bca).
 
There is a blue directed "suffix" arc from each node to the node that is the longest possible strict suffix of it in the graph. For example, for node (caa), its strict suffixes are (aa) and (a) and (). The longest of these that exists in the graph is (a). So there is a blue arc from (caa) to (a). The blue arcs can be computed in linear time by performing a [[breadth-first search]] [potential suffix node will always be at lower level] starting from the root. The target for the blue arc of a visited node can be found by following its parent's blue arc to its longest suffix node and searching for a child of the suffix node whose character matches that of the visited node. If the character does not exist as a child, we can find the next longest suffix (following the blue arc again) and then search for the character. We can do this until we either find the character (as child of a node) or we reach the root (which will always be a suffix of every string).
 
There is a green "dictionary suffix" arc from each node to the next node in the dictionary that can be reached by following blue arcs. For example, there is a green arc from (bca) to (a) because (a) is the first node in the dictionary (i.e. a blue node) that is reached when following the blue arcs to (ca) and then on to (a). The green arcs can be computed in linear time by repeatedly traversing blue arcs until a blue node is found, and [[memoization|memoizing]] this information.
 
[[File:Ahocorasick.svg|thumb|left|A visualization of the trie for the dictionary on the right. Suffix links are in blue; dictionary suffix links in green. Nodes corresponding to dictionary entries are highlighted in blue.]]
 
{| class="wikitable"
|+ Dictionary {a, ab, bab, bc, bca, c, caa}
|-
! Path
! In Dictionarydictionary
! Suffix Linklink
! Dict Suffixsuffix Linklink
|-
| () || -&ndash; || ||
|-
| (a) || + || () ||
Line 29 ⟶ 57:
| (ab) || + || (b) ||
|-
| (b) || -&ndash; || () ||
|-
| (ba) || &ndash; || (a) || (a)
|-
| (bab) || + || (ab) || (ab)
|-
| (bc) || + || (c) || (c)
Line 37 ⟶ 69:
| (c) || + || () ||
|-
| (ca) || -&ndash; || (a) || (a)
|-
| (caa) || + || (a) || (a)
Line 44 ⟶ 76:
{{clear}}
 
At each step, the current node is extended by finding its child, and if that doesn't exist, finding its suffix's child, and if that doesn't work, finding its suffix's suffix's child, and so on, finally ending in the root node if nothing's seen before.
and if that doesn't exist, finding its suffix's child, and if
that doesn't work, finding its suffix's suffix's child, and so on, finally
ending in the root node if nothing's seen before.
 
When the algorithm reaches a node, it outputs all the dictionary entries that end at the current character position in the input text. This is done by printing every node reached by following the dictionary suffix links, starting from that node, and continuing until it reaches a node with no dictionary suffix link. In addition, the node itself is printed, if it is a dictionary entry.
When the algorithm reaches a node, it outputs all the dictionary
entries that end at the current character position in the input text. This is done
by printing every node reached by following the dictionary suffix links, starting
from that node, and continuing until it reaches a node with no dictionary suffix link.
In addition, the node itself is printed, if it is a dictionary entry.
 
Execution on input string <code>{{mono|abccab</code>}} yields the following steps:
 
{| class="wikitable"
|+ Analysis of input string <code>{{mono|abccab</code>}}
|-
! Node
! Remaining Stringstring
! Output:Endend Positionposition
! Transition
! Output
Line 87 ⟶ 112:
|}
 
==Dynamic search list==
 
The original Aho–Corasick algorithm assumes that the set of search strings is fixed. It does not directly apply to applications in which new search strings are added during application of the algorithm. An example is an interactive indexing program, in which the user goes through the text and highlights new words or phrases to index as they see them. [[Bertrand Meyer]] introduced an incremental version of the algorithm in which the search string set can be incrementally extended during the search, retaining the algorithmic complexity of the original.<ref>{{cite journal |first1=Bertrand |last1=Meyer |authorlink=Bertrand Meyer |date=1985 |title=Incremental string matching |journal=Information Processing Letters |volume=21 |issue=5 |pages=219&ndash;227 | doi=10.1016/0020-0190(85)90088-2 | url=http://se.ethz.ch/~meyer/publications/string/string_matching.pdf}}</ref>
==See also==
 
* [[fgrep]]
== See also ==
* [[Commentz-Walter algorithm]]
 
==References==
{{reflist}}
*{{cite journal
| first = Alfred V.
| last = Aho
| authorlink = Alfred Aho
| coauthors = Margaret J. Corasick
| year = 1975
| month = June
| title = Efficient string matching: An aid to bibliographic search
| journal = Communications of the ACM
| volume = 18
| issue = 6
| pages = 333&ndash;340
| doi = 10.1145/360825.360855
| url =
}} (Access to the full text may be restricted.)
 
==External links==
{{Commonscat}}
* [http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf Set Matching and Aho–Corasick Algorithm], lecture slides by Pekka Kilpeläinen
* [http://www.nist.gov/dads/HTML/ahoCorasick.html Aho–Corasick entry] in NIST's [[Dictionary of Algorithms and Data Structures]]
* [http://www.codeproject.com/KB/recipes/ahocorasick.aspx Aho–Corasick string matching in C#], a CodeProject tutorial by Tomáš Petříček
* [http://aresio.blogspot.com/2010/06/aho-corasick-multiple-pattern-match-in.html Aho-Corasick string matching in PHP], optimized for terms-filtering
* [https://sourceforge.net/projects/multifast/ Aho-Corasick C implementation]
* [http://pypi.python.org/pypi/ahocorasick/0.9 Aho-Corasick Python wrapper]
* [http://papercruncher.com/2012/02/26/string-searching-using-aho-corasick/ Aho-Corasick explained using Python]
* [https://hkn.eecs.berkeley.edu/~dyoo/java/index.html Aho-Corasick Java implementation]
* [http://blog.ivank.net/aho-corasick-algorithm-in-as3.html Animated Aho-Corasick], an animated implementation of Aho-Corasick for learning purposes
 
* [https://xlinux.nist.gov/dads/HTML/ahoCorasick.html Aho—Corasick] in NIST's [[Dictionary of Algorithms and Data Structures]] (2019-07-15)
{{DEFAULTSORT:Aho–Corasick String Matching Algorithm}}
* [https://kenneth-ye.github.io/AhoCorasick-Algorithm-Visualizer/ Aho-Corasick Algorithm Visualizer]
[[Category:String matching algorithms]]
 
{{strings}}
[[cs:Algoritmus Aho-Corasick]]
 
[[de:Aho-Corasick-Algorithmus]]
{{DEFAULTSORT:Aho-Corasick String Matching Algorithm}}
[[fa:الگوریتم تطابق رشته آهو-کوراسیک]]
[[Category:String matching algorithms]]
[[fr:Algorithme d'Aho-Corasick]]
[[ko:아호 코라식 알고리즘]]
[[ja:エイホ-コラシック法]]
[[pl:Algorytm Aho-Corasick]]
[[ru:Алгоритм Ахо — Корасик]]
[[uk:Алгоритм Ахо-Корасік]]