Suffix automaton: Difference between revisions

Content deleted Content added
Aaand it's time to publish the translation
Line 1:
{{Infobox data structure
In [[computer science]], a '''suffix automaton''' is the smallest partial [[deterministic finite automaton]] that recognizes the set of [[Substring|suffixes]] of a given [[String (computer science)|string]]. The state graph of the suffix automaton is called the '''directed acyclic word graph''' (DAWG). The term DAWG is also sometimes used for any [[deterministic acyclic finite state automaton]]. The term suffix automaton is also sometimes used for any automaton recognizing suffixes of a text.
| name = Suffix automaton
| image = File:Suffix_automaton_for_abbcbc.svg
| type = [[Substring index]]
| invented_by = Anselm Blumer; Janet Blumer; [[Andrzej Ehrenfeucht]]; [[David Haussler]]; Ross McConnell
| invented_year = [[1983]]
| space_avg = <math>O(n)</math>
| space_worst = <math>O(n)</math>
}}
In [[computer science]], a '''suffix automaton''' is the smallest partial [[deterministic finite automaton]] that recognizes the set of [[Substring|suffixes]] of a given [[String (computer science)|string]] <math>S=s_1 s_2 \dots s_n</math>. The state graph of the suffix automaton is called the '''directed acyclic word graph''' (DAWG). The term DAWG is also sometimes used for any [[deterministic acyclic finite state automaton]]. Less formally, suffix automaton is the [[directed acyclic graph]] with a dedicated initial vertex and the set of "final" vertices, such that its [[Arc (graph theory)|arcs]] are tagged with [[Character (computing)|letters]], none of its [[Node (graph theory)|nodes]] have two outgoing arcs marked with the same letter and for every suffix of <math>S</math> there exists a path from initial vertex to some final vertex such that [[concatenation]] of letters on the path forms this suffix. Among all graphs satisfying aforementioned constraints suffix automaton is the one having least amount of vertices. For example, the suffix automaton of the string "suffix" accepts the strings "suffix", "uffix", "ffix", "fix", "ix", "x" and the empty string. The automaton can be thought of as a compressed form of the [[trie]] of all suffixes of a text.
 
For example, the suffixSuffix automaton ofwas the string "suffix"discovered acceptsby the strings "suffix", "uffix", "ffix", "fix", "ix", "x" and the empty string. The automaton can be thoughtgroup of asscientists afrom compressed form[[University of the [[trieDenver]] ofand all suffixes[[University of aColorado text. It can be constructedBoulder]] in linear1983. timeThey inhave theshown lengththat offor thea string. <math>S<ref name="blumer"/math> For a string {{mvar|S}} of length at least 2, the suffix automaton has at most <math display="inline">2|S| - 1</math> states and at most <math display="inline">3|S| - 4 </math> transitions and suggested [[linear time]] [[online algorithm]] for its construction.<ref name="blumer"/>Further works have shown close connection between suffix automata and [[Suffix tree|suffix trees]] and discovered several generalizations of suffix automaton, such as compacted suffix automaton obtained by compression of nodes with a single outgoing arc.
 
Suffix automata provide efficient solutions of such problems as [[substring search]], computation of the [[Longest common substring problem|largest common substring]] of two and more strings.
Suffix automata have applications in [[approximate string matching]].<ref name="guided">{{citation | last1 = Navarro | first1 = Gonzalo| doi = 10.1145/375360.375365 | title = A guided tour to approximate string matching | journal = [[ACM Computing Surveys]] | volume = 33| issue = 1| pages = 31–88| year = 2001 | pmid = | pmc = | url = http://repositorio.uchile.cl/bitstream/handle/2250/126168/Navarro_Gonzalo_Guided_tour.pdf| citeseerx = 10.1.1.452.6317}}</ref>
 
==Properties History ==
[[File:Anselm Blumer with DAWG.jpg|thumb|288x288px|Anselm Blumer with a drawing of generalized CDAWG for strings ''ababc'' and ''abcab'']]The concept of suffix automaton was introduced by the group of scientists from [[University of Denver]] and [[University of Colorado Boulder]]: Anselm Blumer, Janet Blumer, [[Andrzej Ehrenfeucht]], [[David Haussler]] and Ross McConnell in 1983, although similar concepts were previously studied alongside suffix trees in the works of Peter Weiner,<ref name=":5" /> Vaughan Pratt<ref>{{harvp|Pratt|1973}}</ref> and [[Anatol Slissenko]].<ref>{{harvp|Slisenko|1983}}</ref> In their initial work Blumer et al. showed that suffix automaton build for the string <math>S</math> of length greater than <math>1</math> has at most <math>2|S|-1</math> states and at most <math>3|S|-4</math> transitions and suggested a linear algorithm for automaton construction.<ref name=":2">{{harvp|Blumer et al.|1984|loc=|p=109}}</ref>
 
In 1983 Mu-Tian Chen and Joel Seiferas independently showed that Weiner's 1973 suffix tree construction algorithm<ref name=":5">{{harvp|Weiner|1973}}</ref> while building suffix tree of the string <math>S</math> constructs suffix automaton of the reversed string <math display="inline">S^R</math> as an auxiliary structure.<ref name=":6">{{harvp|Chen, Seiferas|1985|loc=|p=97}}</ref> In 1987 Blumer et al. applied compressing technique used in suffix trees to suffix automaton and invented so-called compacted suffix automaton (also called compacted directed acyclic word graph or CDAWG).<ref name=":7">{{harvp|Blumer et al.|1987|loc=|p=578}}</ref> In 1997 [[Maxime Crochemore]] and Renaud Vérin developed linear algorithm for direct CDAWG construction.<ref>{{harvp|Crochemore, Vérin|1997|loc=|p=192}}</ref> In 2001 Shunsuke Inenaga et al. developed an algorithm for construction of CDAWG for a set of words given by a [[trie]].<ref name=":8">{{harvp|Inenaga et al.|2001|loc=|p=1}}</ref>
[[File:Dawg.svg|thumb|The suffix automaton of abbab. The primary edges are in black, secondary edges are in gray and suffix pointers are in red. Accepting states are marked by a double circle.]]
 
== Definitions ==
The states of the suffix automaton correspond to classes of end-position equivalent strings, defined as follows:
Usually when speaking about suffix automata and related facts some notion from [[formal language theory]] and [[automata theory]] is used, in particular:<ref name=":1" />
 
* ''[[Alphabet (formal languages)|Alphabet]]'' is a finite [[Set (mathematics)|set]] <math>\Sigma</math> used to construct words. Its elements are called ''characters'';
Suppose we have a string {{mvar|S}}. We say that substrings {{mvar|x}} and {{mvar|y}} of {{mvar|S}} are end-position equivalent if the sets of all end positions of occurrences of {{mvar|x}} and {{mvar|y}} in {{mvar|S}} are the same. This relation partitions the substrings of {{mvar|S}} into [[equivalence class|equivalence classes]] which are in a one-to-one correspondence with the states of the suffix automaton. The strings in the class of a node {{mvar|v}} are spelled out by paths from the source to {{mvar|v}}.
* [[Word (formal languages)|''Word'']] is a finite sequence of characters <math>\omega = \omega_1 \omega_2 \dots \omega_n</math>. ''Length'' of the word <math>\omega</math> is denoted as <math>|\omega|=n</math>;
* [[Formal language|''Formal language'']] is a set of words over given alphabet;
* ''Language of all words'' is denoted as <math>\Sigma^*</math>(where the "*" character stands for [[Kleene star]]), ''empty word'' (the word of zero length) is denoted by the character <math>\varepsilon</math>;
* ''[[Concatenation]] of words'' <math>\alpha = \alpha_1 \alpha_2 \dots \alpha_n</math> and <math>\beta = \beta_1 \beta_2 \dots \beta_m</math> is denoted as <math>\alpha \cdot \beta</math> or <math>\alpha\beta</math> and corresponds to the word obtained by writing <math>\beta</math> to the right of <math>\alpha</math>, that is, <math>\alpha \beta = \alpha_1 \alpha_2 \dots \alpha_n \beta_1 \beta_2 \dots \beta_m</math>;
* ''Concatenation of languages'' <math>A</math> and <math>B</math> is denoted as <math>A \cdot B</math> or <math>AB</math> and corresponds to the set of pairwise concatenations <math>AB = \{\alpha \beta : \alpha \in A, \beta \in B\}</math>;
* If the word <math>\omega\in\Sigma^*</math> may be represented as <math>\omega = \alpha\gamma\beta</math>, where <math>\alpha,\beta,\gamma \in \Sigma^*</math>, then words <math>\alpha</math>, <math>\beta</math> and <math>\gamma</math> are called ''prefix'', ''suffix'' and ''[[subword]]'' (substring) of the word <math>\omega</math> correspondingly;
* If <math>T_l T_{l+1} \dots T_r = S</math> then <math>S</math> is said to ''occur'' in <math>T</math> as a subword. Here <math>l</math> and <math>r</math> are called left and right positions of occurrence of <math>S</math> in <math>T</math> correspondingly.
 
== Automaton structure ==
We call the longest string in an equivalence class of a state {{mvar|v}} the representative of that class, and it is spelled out by the longest path from the source to {{mvar|v}}. The automaton has one sink state, which represents the whole string {{mvar|S}}.
Formally, [[deterministic finite automaton]] is determined by [[5-tuple]] <math>\mathcal A = (\Sigma, Q, q_0, F, \delta)</math>, where:
 
* <math>\Sigma</math> is an ''alphabet'' used to construct words,
There is an edge labelled with {{mvar|a}} from the state represented by {{mvar|x}} to the state represented by {{mvar|y}} if and only if {{math|''xa''}} is end-position equivalent with {{mvar|y}}. Each edge of the suffix automaton is either primary or secondary. An edge labelled with character {{mvar|a}} from the state represented by {{mvar|x}} is primary if and only if {{math|''xa''}} is the representative of an equivalence class. The distinction between primary and secondary edges is important in the construction of the automaton and the analysis of the size of the automaton.<ref name="blumer">{{citation | last1 = Blumer | first1 = A. | first2 = J. | last2 = Blumer | last3 = Haussler | first3 = D. | title = The smallest automation recognizing the subwords of a text. | journal = [[Theoretical Computer Science]] | volume = 40| pages = 31–55| year = 1985 | pmid = | pmc = | doi = 10.1016/0304-3975(85)90157-4 }}</ref> For practical applications the automaton is also usually augmented with suffix pointers, which point from a state represented by {{mvar|x}} to the state represented by the longest proper suffix of {{mvar|x}} that represents a class.
* <math>Q</math> is a set of automaton ''[[State (computer science)|states]]'',
* <math>q_0 \in Q</math> is an ''initial'' state of automaton,
* <math>F \subset Q</math> is a set of ''final'' states of automaton,
* <math>\delta : Q \times \Sigma \mapsto Q</math> is a [[Partial function|partial]] ''transition'' function of automaton, such that <math>\delta(q, \sigma)</math> for <math>q \in Q</math> and <math>\sigma \in \Sigma</math> is either undefined or defines a transition from <math>q</math> over character <math>\sigma</math>.
 
Most commonly deterministic finite automaton is represented as a [[directed graph]] (''diagram'') such that:<ref>{{harvp|Серебряков и др.|2006|loc=|pp=50—54}}</ref>
== Applications ==
 
* Set of graph [[Vertex (graph theory)|vertices]] corresponds to the state of states <math>Q</math>,
The suffix automaton of {{mvar|S}} can be used to efficiently decide whether a query string {{mvar|Q}} is a substring of {{mvar|S}}. This is true if and only if there exists a path starting from the source of the automaton that spells {{mvar|Q}}. This works because every substring of {{mvar|S}} is a prefix of a suffix of {{mvar|S}}.
* Graph has a specific marked vertex corresponding to initial state <math>q_0</math>,
* Graph has several marked vertices corresponding to the set of final states <math>F</math>,
* Set of graph [[Arc (graph theory)|arcs]] corresponds to the set of transitions <math>\delta</math>,
* Specifically, every transition <math display="inline">\delta(q_1, \sigma) = q_2</math> is represented by an arc from <math>q_1</math> to <math>q_2</math> marked with the character <math>\sigma</math>. This transition also may be denoted as <math display="inline">q_1 \begin{smallmatrix}{\sigma}\\[-5pt]{\longrightarrow}\end{smallmatrix} q_2</math>.
 
In terms of its diagram, automaton recognizes the word <math>\omega=\omega_1 \omega_2 \dots \omega_m</math> if and only if there exists a path from initial vertex <math>q_0</math> to some final vertex <math>q \in F</math> such that concatenation of characters on this path forms <math>\omega</math>. Set of words recognized by automaton forms up a language, which is set to be recognized by the automaton. In this terms language recognized by suffix automaton of <math>S</math> is the language of its (possibly empty) suffixes.<ref name=":1">{{harvp|Crochemore, Hancart|1997|pp=3—6|loc=}}</ref>
When combined with [[dynamic programming]], the automaton can be used to answer many interesting questions about {{mvar|S}}. For example, the number of distinct substrings of {{mvar|S}} is equal to the number of distinct paths in the automaton starting from the source, which can be counted using dynamic programming on the automaton. The number of occurrences of any substring {{mvar|Q}} of {{mvar|S}} can be counted by finding the state {{mvar|v}} such that a path from the source to {{mvar|v}} spells {{mvar|Q}} and then counting the number of paths from {{mvar|v}} to the sink state of the automaton.
 
=== Automaton states ===
== Relationships with other suffix structures ==
{{Main|DFA minimization}}
''Right context'' of the word <math>\omega</math> with respect to language <math>L</math> is a set <math>[\omega]_R=\{\alpha : \omega\alpha \in L\}</math>. That is, a set of words <math>\alpha</math> such that their concatenation with <math>\omega</math> forms a word from <math>L</math>. Right contexts induce a natural [[equivalence relation]] <math>[\alpha]_R = [\beta]_R</math> on the set of all words. If language <math>L</math> is recognized by some deterministic finite automaton then there exists unique up to [[isomorphism]] automaton which recognizes the same language and has minimum possible number of states. Such automaton is called [[minimal automaton]] for the given language <math>L</math>. [[Myhill–Nerode theorem]] allows to define it explicitly in terms of right contexts:''<ref>{{harvp|Рубцов|2019|loc=|pp=89—94}}</ref>''<ref>{{harvp|Hopcroft, Ullman|1979|loc=|pp=65—68}}</ref>
 
{{Math theorem
[[File:Suffix structures diamond.svg|thumb|Relationship of the suffix trie, suffix tree, DAWG and CDAWG.<ref name="jewels">{{citation | last1 = Crochemore | first1 = Maxime | first2 = Wojciech | last2 = Rytter | title = Jewels of stringology: text algorithms |title-link= Jewels of Stringology | year = 2003 | pmid = | pmc = }}</ref>]]
| math_statement = Minimal automaton recognizing language <math>L</math> over the alphabet <math>\Sigma</math> may be explicitly defined in the following way:
 
* Alphabet <math>\Sigma</math> stays the same,
* States <math>Q</math> correspond to right contexts <math>[\omega]_R</math> of all possible words <math>\omega \in \Sigma^*</math>,
* Initial state <math>q_0</math> corresponds to the right context of the empty word <math>[\varepsilon]_R</math>,
* Final states <math>F</math> correspond to right contexts <math>[\omega]_R</math> of words from <math>\omega \in L</math>,
* Transitions <math>\delta</math> are given by <math>[\omega]_R \begin{smallmatrix}{\sigma}\\[-5pt]{\longrightarrow}\end{smallmatrix} [\omega\sigma]_R</math>, where <math>\omega \in \Sigma^*</math> and <math>\sigma \in \Sigma</math>.
}}
 
In these terms, ''suffix automaton'' is the minimal deterministic finite automaton recognizing the language of suffixes of the word <math>S=s_1s_2\dots s_n</math>. Right context of the word <math>\omega</math> with respect to this language constitutes of words <math>\alpha</math> such that <math>\omega \alpha</math> is a suffix of <math>S</math>. It allows to formulate the following lemma defining [[bijection]] between right context of the word and the set of right positions of its occurrences in <math>S</math>:<ref name=":9">{{harvp|Blumer et al.|1984|loc=|pp=111—114}}</ref><ref name=":10">{{harvp|Crochemore, Hancart|1997|loc=|pp=27—31}}</ref>
 
{{Math theorem
| math_statement = Let <math>endpos(\omega)=\{r: \omega=s_l \dots s_r \}</math> be the set of right positions of occurrences of <math>\omega</math> in <math>S</math>.
 
There is a following bijection between <math>endpos(\omega)</math> and <math>[\omega]_R</math>:
* If <math>x \in endpos(\omega)</math>, then <math>s_{x+1}s_{x+2}\dots s_n \in [\omega]_R</math>;
* If <math>\alpha \in [\omega]_R</math>, then <math>n-\vert\alpha\vert \in endpos(\omega)</math>.
}}
 
For example, for the word <math>S=abacaba</math> and its subword <math>\omega = ab</math> it holds that <math>endpos(ab)=\{2,6\}</math> and <math>[ab]_R = \{a,acaba\}</math>. Informally, <math>[ab]_R</math> is formed by words which follow occurrences of <math>ab</math> to the end of <math>S</math> and <math>endpos(ab)</math> is formed by right positions of those occurrences. In this example, the element <math>x=2 \in endpos(ab)</math> corresponds to the word <math>s_3 s_4 s_5 s_6 s_7 = acaba \in [ab]_R</math> while the word <math>a \in [ab]_R</math> corresponds to the element <math>7-|a|=6 \in endpos(ab)</math>.
 
It implies several structure properties of suffix automaton states. Let <math>|\alpha| \leq |\beta|</math>, then:<ref name=":10" />
 
* If <math>[\alpha]_R</math> and <math>[\beta]_R</math> have at least one common element <math>x</math>, then <math>endpos(\alpha)</math> and <math>endpos(\beta)</math> have a common element as well. It implies that <math>\alpha</math> is a suffix of <math>\beta</math> and therefore <math>endpos(\beta) \subset endpos(\alpha)</math> and <math>[\beta]_R \subset [\alpha]_R</math>. In aforementioned example, <math>a \in [ab]_R \cap [cab]_R</math>, so <math>ab</math> is a suffix of <math>cab</math> and thus <math>[cab]_R=\{a\} \subset \{a,acaba\} = [ab]_R</math> and <math>endpos(cab) = \{6\} \subset \{2,6\} = endpos(ab)</math>;
* If <math>[\alpha]_R=[\beta]_R</math>, then <math>endpos(\alpha)=endpos(\beta)</math>, thus <math>\alpha</math> occurs in <math>S</math> only as a suffix of <math>\beta</math>. For example, for <math>\alpha=b</math> and <math>\beta = ab</math> it holds that <math>[b]_R = [ab]_R = \{a,acaba\}</math> and <math>endpos(b)=endpos(ab)=\{2,6\}</math>;
* If <math>[\alpha]_R=[\beta]_R</math> and <math>\gamma</math> is a suffix of <math>\beta</math> such that <math>|\alpha| \leq |\gamma| \leq |\beta|</math>, then <math>[\alpha]_R = [\gamma]_R = [\beta]_R</math>. In the example above <math>[c]_R = [bac]_R = \{aba\}</math> and it holds for "intermediate" suffix <math>\gamma = ac</math> that <math>[ac]_R = \{aba\}</math>.
 
That being said, any state <math>q = [\alpha]_R</math> of the suffix automaton recognizes some continuous [[Total order|chain]] of nested suffixes of the longest word recognized by this state.<ref name=":10" />
 
''Left extension'' <math>\overset{\scriptstyle{\leftarrow}}{\gamma}</math> of the string <math>\gamma</math> is the longest string <math>\omega</math> which has the same right context as <math>\gamma</math>. Length <math>|\overset{\scriptstyle{\leftarrow}}{\gamma}|</math> of the longest string recognized by <math>q=[\gamma]_R</math> is denoted by <math>len(q)</math>. It holds that:<ref name=":3">{{harvp|Inenaga et al.|2005|loc=|pp=159—162}}</ref>
 
{{Math theorem
| math_statement = Left extension of <math>\gamma</math> may be represented as <math>\overleftarrow{\gamma} = \beta \gamma</math>, where <math>\beta</math> is the longest word such that any occurrence of <math>\gamma</math> in <math>S</math> is preceded by <math>\beta</math>.
}}
 
''Suffix link'' <math>link(q)</math> of the state <math>q=[\alpha]_R</math> is the pointer to the state <math>p</math> which contains the largest suffix of <math>\alpha</math> which is not recognized by <math>q</math>.
 
In this terms it can be said that <math>q=[\alpha]_R</math> recognizes exactly all suffixes of <math>\overset{\scriptstyle{\leftarrow}}{\alpha}</math> which length is greater than <math>len(link(q))</math> and not greater than <math>len(q)</math>. Moreover, it holds that:<ref name=":3" />
 
{{Math theorem
| math_statement = Suffix links form a [[Tree (computer science)|tree]] <math>\mathcal T(V, E)</math> which may be defined explicitly in the following way:
# [[Vertex (graph theory)|Vertices]] <math>V</math> of the tree correspond to left extensions <math>\overleftarrow{\omega}</math> of all <math>S</math> substrings,
# [[Edge (graph theory)|Edges]] <math>E</math> of the tree connect pairs of vertices <math>(\overleftarrow{\omega},\overleftarrow{\alpha\omega})</math>, such that <math>\alpha \in \Sigma</math> and <math>\overleftarrow{\omega} \neq \overleftarrow{\alpha\omega}</math>.
}}
 
=== Connection with suffix trees ===
[[File:Suffix structures diamond.svg|thumb|Relationship of the suffix trie, suffix tree, DAWG and CDAWG]]
{{Main|Suffix tree}}
''[[Prefix tree]]'' (or ''trie'') is a rooted directed tree which arcs are marked by characters in such way that no vertex <math>v</math> of such tree has two out-going arcs marked with the same character. Some vertices in trie are marked as final. Trie is said to recognize a set of words defined by pathes from its root to final vertices. In this way prefix trees are special kind of deterministic finite automata if you perceive its root as an initial vertex.<ref>{{harvp|Rubinchik, Shur|2018|pp=1—2}}</ref> ''Suffix trie'' of the word <math>S</math> is a prefix tree recognizing set of its suffixes. ''[[Suffix tree]]'' is a tree obtained from suffix trie via the compaction procedure during which consequent edges are merged together if the the [[Degree (graph theory)|degree]] of the vertex between them is equal to 2.<ref name=":3" />
 
By its definition, suffix automaton can be obtained via [[DFA minimization|minimization]] of suffix trie. It may be shown that compacted suffix automaton is obtained by both minimization of suffix tree (if assume that each string on the edge of the suffix tree is a solid character from the alphabet) and compaction of suffix automaton.<ref name=":0">{{harvp|Inenaga et al.|2005|с=|loc=|pp=156—158}}</ref> Besides this connection between suffix tree and suffix automaton of the same string there is as well a connection between the suffix automaton of the string <math>S=s_1 s_2 \dots s_n</math> and the suffix tree of the reversed string <math>S^R = s_n s_{n-1} \dots s_1</math>.<ref name=":11">{{harvp|Fujishige et al.|2016|loc=|pp=1—3}}</ref>
 
Similarly to right contexts one may introduce ''left contexts'' <math>[\omega]_L = \{\beta \in \Sigma^* : \beta \omega \in L\}</math>, ''right extensions'' <math>\overset{\scriptstyle{\rightarrow}}{\omega~}</math> corresponding to the longest string having same left context as <math>\omega</math> and the equivalence relation <math>[\alpha]_L = [\beta]_L</math>. If one considers right extensions with respect to the language <math>L</math> of ''prefixes'' of the string <math>S</math> it may be obtained that:<ref name=":3" />
 
{{Math theorem
| math_statement = Suffix tree of the string <math>S</math> may be defined explicitly in the following way:
* Vertices <math>V</math> of the tree correspond to right extensions <math>\overrightarrow{\omega}</math> of all <math>S</math> substrings,
* Edges <math>E</math> correspond to triplets <math>(\overrightarrow{\omega},x\alpha, \overrightarrow{\omega x})</math> such that <math>x \in \Sigma</math> and <math>\overrightarrow{\omega x}=\overrightarrow{\omega} x \alpha</math>.
 
Here triplet <math>(v_1, \omega, v_2) \in E</math> means that there is an edge from <math>v_1</math> to <math>v_2</math> with the string <math>\omega</math> written on it.
}}
 
Which implies that the suffix link tree of the string <math>S</math> and the suffix tree of the string <math>S^R</math> are isomorphic:<ref name=":11" />
<center>
{| class="wikitable mw-collapsible mw-collapsed"
|-
! Suffix structures of words ''abbcbc'' and ''cbcbba''&nbsp;
|-
|<gallery widths="380" heights="320" class="center">
File:Suffix automaton for abbcbc.svg|Suffix automaton of the word ''abbcbc''
File:Suffix structures.svg|Suffix trie, suffix tree and CDAWG of the word ''abbcbc''
File:Suffix tree for cbcbba.svg|Suffix tree of the word ''cbcbba''<br>(Suffix link tree of the word ''abbcbc'')
</gallery>
|}
</center>
 
Similarly to the case of left extensions, the following lemma holds for right extensions:<ref name=":3" />
 
{{Math theorem
| math_statement = Right extension of the string <math>\gamma</math> may be represented as <math>\overrightarrow{\gamma} = \gamma\alpha</math>, where <math>\alpha</math> is the longest word such that every occurrence of <math>\gamma</math> in <math>S</math> is succeeded by <math>\beta</math>.
}}
 
=== Size ===
Suffix automaton of the string <math>S</math> of length <math>n>1</math> has at most <math>2n-1</math> states and at most <math>3n-4</math> transitions. These bounds are reached on strings <math>abb\dots bb=ab^{n-1}</math> and <math>abb\dots bc=ab^{n-2}c</math> correspondingly.<ref name=":9" /> This may as well be formulated in a stricter way as <math>|\delta| \leq |Q| + n - 2</math> where <math>|\delta|</math> and <math>|Q|</math> are the numbers of transitions and states in automaton correspondingly<ref name=":10" />.
<center>
{| class="wikitable mw-collapsible mw-collapsed"
|-
! Maximal suffix automata
|-
|<gallery widths="380" heights="320" class="center">
File:DAWG for abb...b.svg|Suffix automaton of <math>ab^{n-1}</math>
File:DAWG for abb...bc.svg|Suffix automaton of <math>ab^{n-2}c</math>
</gallery>
|}
</center>
 
== Construction ==
Initially automaton only consists of a single state corresponding to the empty word, then characters of the string are added one by one and the automaton is rebuilt on each step incrementally.<ref name=":12">{{harvp|Crochemore, Hancart|1997|loc=|pp=31—36}}</ref>
 
=== State updates ===
After new character is appended to the string, some equivalence classes are altered. Let <math>[\alpha]_{R_\omega}</math> be the right context of <math>\alpha</math> with respect to the language of <math>\omega</math> suffixes. Then the transition from <math>[\alpha]_{R_\omega}</math> to <math>[\alpha]_{R_{\omega x}}</math> after <math>x</math> is appended to <math>\omega</math> is defined by lemma:<ref name=":10" />
 
{{Math theorem
| math_statement = Let <math>\alpha, \omega \in \Sigma^*</math> be some words over <math>\Sigma</math> and <math>x \in \Sigma</math> be some character from this alphabet. Then there is a following correspondence between <math>[\alpha]_{R_\omega}</math> and <math>[\alpha]_{R_{\omega x} }</math>:
* <math>[\alpha]_{R_{\omega x} } = [\alpha]_{R_\omega}x \cup \{\varepsilon\}</math> if <math>\alpha</math> is a suffix of <math>\omega x</math>;
* <math>[\alpha]_{R_{\omega x} } = [\alpha]_{R_\omega}x</math> otherwise.
}}
 
That means that after adding <math>x</math> to the current word <math>\omega</math> the right context of <math>\alpha</math> may change significantly only if <math>\alpha</math> is a suffix of <math>\omega x</math>. It implies that equivalence relation <math>\equiv_{R_{\omega x} }</math> is a [[Partition of a set#Refinement of partitions|refinement]] of <math>\equiv_{R_\omega}</math>. In other words, if <math>[\alpha]_{R_{\omega x} } = [\beta]_{R_{\omega x} }</math>, then <math>[\alpha]_{R_{\omega} } = [\beta]_{R_{\omega} }</math>. Moreover after the addition of a new character at most two equivalence classes of <math>\equiv_{R_\omega}</math> will be split and each of them may split in at most two new classes. First of all, equivalence class corresponding to [[Empty set|empty]] right context is always split into two equivalence classes, one of them corresponding to <math>\omega x</math> itself and having <math>\{\varepsilon\}</math> as a right context. This new equivalence class contains exactly <math>\omega x</math> and all its suffixes which did not occur in <math>\omega</math>, as the right context of such words was empty before and contains only empty word now.<ref name=":10" />
 
Given the correspondence between states of suffix automaton and vertices of suffix tree, it is possible to find out the second state which may possibly split after new character is appended. The transition from <math>\omega</math> to <math>\omega x</math> corresponds to the transition from <math>\omega^R</math> to <math>x \omega^R</math> in the reversed string. In terms of suffix trees it corresponds to the insertion of new longest suffix <math>x\omega^R</math> into the suffix tree of <math>\omega^R</math>. At most two new vertices may be formed after this insertion: one of them corresponding to <math>x \omega^R</math>, while the other one corresponds to its direct ancestor if there was a branching. Returning to suffix automata it means that the first new state recognizes <math>\omega x</math> and the second one (if there is a second new state) is its suffix link. It may be stated as a lemma:<ref name=":10" />
 
{{Math theorem
| math_statement = Let <math>\omega \in \Sigma^*</math>, <math>x \in \Sigma</math> be some word and character over <math>\Sigma</math>. Also let <math>\alpha</math> be the longest suffix of <math>\omega x</math> which occurs in <math>\omega</math>, and let <math>\beta=\overset{\scriptstyle{\leftarrow} }{\alpha}</math>. Then for any substrings <math>u,v</math> of <math>\omega</math> it holds that:
* If <math>[u]_{R_\omega} = [v]_{R_\omega}</math> and <math>[u]_{R_\omega} \neq [\alpha]_{R_\omega}</math>, then <math>[u]_{R_{\omega x} } = [v]_{R_{\omega x} }</math>;
* If <math>[u]_{R_\omega} = [\alpha]_{R_\omega}</math> and <math>\vert u\vert \leq \vert \alpha\vert </math>, then <math>[u]_{R_{\omega x} } = [\alpha]_{R_{\omega x} }</math>;
* If <math>[u]_{R_\omega} = [\alpha]_{R_\omega}</math> and <math>\vert u\vert > \vert \alpha\vert </math>, then <math>[u]_{R_{\omega x} } = [\beta]_{R_{\omega x} }</math>.
}}
 
It implies that if <math>\alpha=\beta</math> (for example, when <math>x</math> didn't occur in <math>\omega</math> at all and <math>\alpha=\beta=\varepsilon</math>), then only the equivalence class corresponding to the empty right context is split.<ref name=":10" />
 
Besides suffix links it's also needed to define final states of the automaton. It follows from structure properties that all suffixes of a word <math>\alpha</math> recognized by <math>q=[\alpha]_R</math> are recognized by some vertex on ''suffix path'' <math>(q, link(q), link^2(q), \dots)</math> of <math>q</math>. Namely, suffixes with length greater than <math>len(link(q))</math> lie in <math>q</math>, suffixes with length greater than <math>len(link(link(q))</math> but not greater than <math>len(link(q))</math> lie in <math>link(q)</math> and so on. Thus if the state recognizing <math>\omega</math> is denoted by <math>last</math>, then all final states (that is, recognizing suffixes of <math>\omega</math>) form up the sequence <math>(last, link(last), link^2(last), \dots)</math>.<ref name=":12" />
 
=== Transitions and suffix links updates ===
After the character <math>x</math> is appended to <math>\omega</math> possible new states of suffix automaton are <math>[\omega x]_{R_{\omega x} }</math> and <math>[\alpha]_{R_{\omega x} }</math>. Suffix link from <math>[\omega x]_{R_{\omega x} }</math> goes to <math>[\alpha]_{R_{\omega x} }</math> and from <math>[\alpha]_{R_{\omega x} }</math> it goes to <math>link([\alpha]_{R_{\omega} })</math>. Words from <math>[\omega x]_{R_{\omega x} }</math> occur in <math>\omega x</math> only as its suffixes therefore there should be no transitions at all from <math>[\omega x]_{R_{\omega x}}</math> while transitions to it should go from suffixes of <math>\omega</math> having length at least <math>\alpha</math> and be marked with the character <math>x</math>. State <math>[\alpha]_{R_{\omega x}}</math> is formed by subset of <math>[\alpha]_{R_\omega}</math>, thus transitions from <math>[\alpha]_{R_{\omega x}}</math> should be same as from <math>[\alpha]_{R_\omega}</math>. Meanwhile transitions leading to <math>[\alpha]_{R_{\omega x}}</math> should go from suffixes of <math>\omega</math> having length less than <math>|\alpha|</math> and at least <math>len(link([\alpha]_{R_{\omega} }))</math>, as such transitions have lead to <math>[\alpha]_{R_\omega}</math> before and corresponded to seceded part of this state. States corresponding to these suffixes may be determined via traversal of suffix link path for <math>[\omega]_{R_\omega}</math>.<ref name=":12" />
 
<center>
{| class="wikitable mw-collapsible mw-collapsed"
|-
! colspan="2" | Construction of the suffix automaton for the word ''abbcbc''&nbsp;
|-
| style="vertical-align:top;width:50%" align="center" |
{|style="text-align:center;width:600px"
|+'''''∅ → a'''''
|style="background:#D8FBD8;width:50%"|[[File:single letter SA.svg|280x150px]]
|style="background:#FFF8CC;width:50%"|[[File:single letter SA.svg|280x150px]]
|-
|style="vertical-align:top;"|<small>After first character is appended, only one state is created in suffix automaton.</small>
|style="vertical-align:top;"|<small>Similarly, only one leaf is added to the suffix tree.</small>
|}
| style="vertical-align:top;width:50%" align="center" |
{|style="text-align:center;width:600px"
|+'''''a → ab'''''
|style="background:#D8FBD8;width:50%"|[[File:ab SA.svg|280x150px]]
|style="background:#FFF8CC;width:50%"|[[File:ba ST.svg|280x150px]]
|-
|style="vertical-align:top;"|<small>New transitions are drawn from all previous final states as ''b'' didn't appear before.</small>
|style="vertical-align:top;"|<small>For the same reason another leaf is added to the root of the suffix tree.</small>
|}
|-
| style="vertical-align:top;" align="center" |
{|style="text-align:center;width:600px"
|+'''''ab → abb'''''
|style="background:#D8FBD8;width:50%"|[[File:abb SA.svg|280x180px]]
|style="background:#FFF8CC;width:50%"|[[File:bba ST.svg|280x180px]]
|-
|style="vertical-align:top;"|<small>The state 2 recognizes words ''ab'' and ''b'', but only ''b'' is the new suffix, therefore this word is separated into state 4.</small>
|style="vertical-align:top;"|<small>In the suffix tree it corresponds to the split of the edge leading to the vertex 2.</small>
|}
| style="vertical-align:top;" align="center" |
{|style="text-align:center;width:600px"
|+'''''abb → abbc'''''
|style="background:#D8FBD8;width:50%"|[[File:abbc SA.svg|280x180px]]
|style="background:#FFF8CC;width:50%"|[[File:cbba ST.svg|280x180px]]
|-
|style="vertical-align:top;"|<small>Character ''c'' occurs for the first time, so transitions are drawn from all previous final states.</small>
|style="vertical-align:top;"|<small>Suffix tree of reverse string has another leaf added to the root.</small>
|}
|-
| style="vertical-align:top;" align="center" |
{|style="text-align:center;width:600px"
|+'''''abbc → abbcb'''''
|style="background:#D8FBD8;width:50%"|[[File:abbcb SA.svg|280x210px]]
|style="background:#FFF8CC;width:50%"|[[File:bcbba ST.svg|280x210px]]
|-
|style="vertical-align:top;"|<small>State 4 consists of the only word ''b'' which is suffix, thus the state is not split.</small>
|style="vertical-align:top;"|<small>Correspondingly, new leaf is hanged on the vertex 4 in the suffix tree.</small>
|}
| style="vertical-align:top;" align="center" |
{|style="text-align:center;width:600px"
|+'''''abbcb → abbcbc'''''
|style="background:#D8FBD8;width:50%"|[[File:Suffix Automaton extension.svg|280x210px]]
|style="background:#FFF8CC;width:50%"|[[File:Suffix Tree extension.svg|280x210px]]
|-
|style="vertical-align:top;"|<small>The state 5 recognizes words ''abbc'', ''bbc'', ''bc'' and ''c'', but only last two are suffixes of new word, so they're separated into new state 8.</small>
|style="vertical-align:top;"|<small>Correspondingly, edge leading to the vertex 5 is split and vertex 8 is put in the middle of the edge.</small>
|}
|}
</center>
 
 
=== Construction algorithm ===
Theoretical results above lead to the following algorithm which takes character <math>x</math> and rebuilds the suffix automaton of <math>\omega</math> into the suffix automaton of <math>\omega x</math>:<ref name=":12" />
 
# The state corresponding to the word <math>\omega</math> is kept as <math>last</math>;
# After <math>x</math> is appended, previous value of <math>last</math> is stored in the variable <math>p</math> and <math>last</math> itself is reassigned to the new state corresponding to <math>\omega x</math>;
# States corresponding to suffixes of <math>\omega</math> are updated with transitions to <math>last</math>. To do this one should go through <math>p, link(p), link^2(p),\dots</math>, until there is a state which already has a transition by <math>x</math>;
# Once the aforementioned loop is over, there are 3 cases:
## If none of states on the suffix path had a transition by <math>x</math>, then <math>x</math> never occurred in <math>\omega</math> before and the suffix link from <math>last</math> should lead to <math>q_0</math>;
## If the transition by <math>x</math> is found and leads from the state <math>p</math> to the state <math>q</math>, such that <math>len(p)+1=len(q)</math>, then <math>q</math> doesn't have to be split and it's a suffix link of <math>last</math>;
## If the transition is found but <math>len(q) > len(p)+1</math>, then words from <math>q</math> having length at most <math>len(p)+1</math> should be segregated into new "clone" state <math>cl</math>;
# If the previous step was concluded with the creation of <math>cl</math>, then transitions from it and its suffix link should copy those of <math>q</math>, at the same time <math>cl</math> is assigned to be common suffix link of both <math>q</math> and <math>last</math>;
# Transitions which have lead to <math>q</math> before but corresponded to words of the length at most <math>len(p)+1</math> are redirected to <math>cl</math>. To do this one continues going through the suffix path of <math>p</math> until the state is found such that transition by <math>x</math> from it doesn't lead to <math>q</math>.
 
The whole procedure is described by the following pseudo-code:
 
'''function''' {{mvar|add_letter(x)}}:
'''define''' {{mvar|p {{=}} last}}
'''assign''' {{mvar|last {{=}} new_state()}}
'''assign''' {{mvar|len(last) {{=}} len(p) + 1}}
'''while''' {{mvar|δ(p, x)}} is undefined:
'''assign''' {{mvar|δ(p, x) {{=}} last, p {{=}} link(p)}}
'''define''' {{mvar|q {{=}} δ(p, x)}}
'''if''' {{mvar|q {{=}} last}}:
'''assign''' {{mvar|link(last) {{=}} q<sub>0</sub>}}
'''else if''' {{mvar|len(q) {{=}} len(p) + 1}}:
'''assign''' {{mvar|link(last) {{=}} q}}
'''else''':
'''define''' {{mvar|cl {{=}} new_state()}}
'''assign''' {{mvar|len(cl) {{=}} len(p) + 1}}
'''assign''' {{mvar|δ(cl) {{=}} δ(q), link(cl) {{=}} link(q)}}
'''assign''' {{mvar|link(last) {{=}} link(q) {{=}} cl}}
'''while''' {{mvar|δ(p, x) {{=}} q}}:
'''assign''' {{mvar|δ(p, x) {{=}} cl, p {{=}} link(p)}}
 
Here <math>q_0</math> is the initial state of the automaton and <math>new\_state()</math> is a function creating new state for it. It is assumed that <math>last</math>, <math>len</math>, <math>link</math> and <math>\delta</math> are stored as global variables.
 
=== Complexity ===
Complexity of the algorithm may vary depending on the underlying structure used to store transitions of the automaton. It may be implemented in <math>O(n \log |\Sigma|)</math> with <math>O(n)</math> memory overhead or in <math>O(n)</math> with <math>O(n |\Sigma|)</math> memory overhead if one assumes that memory allocation is done in <math>O(1)</math>. To obtain such complexity one has to use the methods of [[amortized analysis]]. Specifically, the value of <math>len(p)</math> strictly reduces with each iteration of the cycle while it may only increase by as much as 1 after the first iteration of the cycle on the next ''add_letter'' call. Overall value of <math>len(p)</math> never exceeds <math>n</math> and it only increased by one between iterations of appending new letters which suggests that total complexity is at most linear as well. The linearity of the second cycle is shown in a similar way.<ref name=":12" />
 
== Generalizations ==
Suffix automaton is closely related to other suffix structures and [[Substring index|substring indices]]. Given a suffix automaton of a specific string one may construct its suffix tree via compacting and recursive traversal in linear time.<ref>{{harvp|Паращенко|2007|loc=|pp=19—22}}</ref> Similar transforms are possible in both directions to switch between the suffix automaton of <math>S</math> and the suffix tree of reversed string <math>S^R</math>.<ref name=":11" /> Other than this several generalizations were developed to construct automaton for the set of strings given by trie,<ref name=":8" /> compacted suffix automation (CDAWG),<ref name=":7" /> to maintain the structure of the automaton on the sliding window,<ref>{{harvp|Blumer|1987|loc=|p=451}}</ref> and to construct it in a bidirectional way, supporting the insertion of a characters to both the beginning and the end of the string.<ref>{{harvp|Inenaga|2003|loc=|p=1}}</ref>
 
=== Compacted suffix automaton ===
As was already mentioned above, compacted suffix automaton is obtained via both compaction of a regular suffix automaton (by removing states which are non-final and have exactly one out-going arc) and the minimization of a suffix tree. Similarly to regular suffix automaton, states of compacted suffix automaton may be defined in explicit manner. ''A two-way extension'' <math>\overset{\scriptstyle{\longleftrightarrow}}{\gamma}</math> of a word <math>\gamma</math> is the longest word <math>\omega = \beta \gamma \alpha</math>, such that every occurrence of <math>\gamma</math> in <math>S</math> is preceded by <math>\beta</math> and succeeded by <math>\alpha</math>. In terms of left and right extensions it means that two-way extension is the left extension of the right extension or, which is equivalent, the right extension of the left extension, that is <math display="inline">\overset{\scriptstyle\longleftrightarrow}{\gamma} = \overset{\scriptstyle\leftarrow}{\overset{\rightarrow}{\gamma}} = \overset{\rightarrow}{\overset{\scriptstyle\leftarrow}{\gamma}}</math>. In terms of two-way extensions compacted automaton is defined as follows:<ref name=":3" />
 
{{Math theorem
| math_statement = Compacted suffix automaton of the word <math>S</math> is defined by a pair <math>(V, E)</math>, where:
 
* <math>V = \{\overleftrightarrow \omega : \omega \in \Sigma^*\}</math> is a set of automaton states;
* <math>E = \{(\overleftrightarrow \omega, x \alpha, \overleftrightarrow {\omega x}) : x \in \Sigma, \alpha \in \Sigma^*, \overleftrightarrow{\omega x} = \overleftrightarrow{\omega} x\alpha\}</math> is a set of automaton transitions.
}}
 
Two-way extensions induce an equivalence relation <math display="inline">\overset{\scriptstyle\longleftrightarrow}{\alpha} = \overset{\scriptstyle\longleftrightarrow}{\beta}</math> which defines the set of words recognized by the same state of compacted automaton. This equivalence relation is a [[transitive closure]] of the relation defined by <math display="inline">(\overset{\scriptstyle{\rightarrow}}{\alpha\,}=\overset{\scriptstyle{\rightarrow}}{\beta\,}) \vee (\overset{\scriptstyle{\leftarrow}}{\alpha} = \overset{\scriptstyle{\leftarrow}}{\beta})</math>, which highlights the fact that compacted automaton may be obtained by both gluing suffix tree vertices equivalent via <math>\overset{\scriptstyle{\leftarrow}}{\alpha} = \overset{\scriptstyle{\leftarrow}}{\beta}</math> relation (minimization of the suffix tree) and gluing suffix automaton states equivalent via <math>\overset{\scriptstyle{\rightarrow}}{\alpha\,}=\overset{\scriptstyle{\rightarrow}}{\beta\,}</math> relation (compaction of suffix automaton).<ref name=":13">{{harvp|Blumer et al.|1987|loc=|p=|pp=585—588}}</ref> If words <math>\alpha</math> and <math>\beta</math> have same right extensions, and words <math>\beta</math> and <math>\gamma</math> have same left extensions, then cumulatively all strings <math>\alpha</math>, <math>\beta</math> and <math>\gamma</math> have same two-way extensions. At the same time it may happen that neither left nor right extensions of <math>\alpha</math> and <math>\gamma</math> coincide. As an example one may take <math>S=\beta=ab</math>, <math>\alpha=a</math> and <math>\gamma=b</math>, for which left and right extensions are as follows: <math>\overset{\scriptstyle{\rightarrow}}{\alpha\,}=\overset{\scriptstyle{\rightarrow}}{\beta\,}=ab=\overset{\scriptstyle{\leftarrow}}{\beta} = \overset{\scriptstyle{\leftarrow}}{\gamma}</math>, but <math>\overset{\scriptstyle{\rightarrow}}{\gamma\,}=b</math> and <math>\overset{\scriptstyle{\leftarrow}}{\alpha}=a</math>. That being said, while equivalence relations of one-way extensions were formed by some continuous chain of nested prefixes or suffixes, bidirectional extensions equivalence relations are more complex and the only thing one may conclude for sure is that strings with the same two-way extension are ''substrings'' of the longest string having the same two-way extension, but it may even happen that they don't have any non-empty substring in common. Total number of equivalence classes for this relation does not exceed <math>n+1</math> which implies that compacted suffix automaton of the string having length <math>n</math> has at most <math>n+1</math> states. The amount of transitions in such automaton is at most <math>2n-2</math>.<ref name=":3" />
 
=== Suffix automaton of several strings ===
Consider a set of waords <math>T=\{S_1, S_2, \dots, S_k\}</math>. It is possbile to construct a generalization of suffix automaton which would recognize the language formed up by suffixes of all words from the set. Constraints for the number of states and transitions in such automaton would stay the same as for a single-word automaton if you put <math>n = |S_1| + |S_2| + \dots + |S_k|</math>.<ref name=":13" /> The algorithms itself is similar to the construction of single-word automaton except that, instead of <math>last</math> state, function ''add_letter'' would work with the state corresponding to the word <math>\omega_i</math> assuming the transition from the set of words <math>\{\omega_1, \dots, \omega_i, \dots, \omega_k\}</math> to the set <math>\{\omega_1, \dots, \omega_i x, \dots, \omega_k\}</math>.<ref>{{harvp|Blumer et al.|1987|loc=|p=|pp=588—589}}</ref><ref>{{harvp|Blumer et al.|1987|loc=|p=593|pp=}}</ref>
 
This idea is further generalized to the case when <math>T</math> is not given explicitly, but instead is given by a [[prefix tree]] with <math>Q</math> vertices. Mohri et al. showed that such automaton would have at most <math>2Q-2</math> and may be constructed in linear time from its size. At the same time the amount of transitions in such automaton may reach <math>O(Q|\Sigma|)</math>, for example for the set of words <math>T=\{\sigma_1, a\sigma_1, a^2\sigma_1, \dots, a^n \sigma_1, a^n \sigma_2, \dots, a^n \sigma_k \}</math> over the alphabet <math>\Sigma=\{a,\sigma_1,\dots,\sigma_k\}</math> the total length of workds is equal to <math display="inline">O(n^2+nk)</math>, the number of vertices in corresponding suffix trie is equal to <math>O(n+k)</math> and corresponding suffix automaton is formed of <math>O(n+k)</math> states and <math>O(nk)</math> transitions. Algorithm suggested by Mohri mainly repeats the generic algorithm for building automaton of several strings, but instead of growing words one by one it traverses the trie in a [[breadth-first search]] order and append new characters as it meet them in the traversal, which guarantees amortized linear complexity.<ref>{{harvp|Mohri et al.|2009|loc=|страницы=|p=|pp=3558—3560}}</ref>
 
=== Sliding window ===
Some [[compression algorithms]], such as [[LZ77]] and [[Run-length encoding|RLE]] may benefit from storing suffix automaton or similar structure not for the whole string but for only last <math>k</math> its characters while the string is updated. This is due to the fact that compressing data is usually expressively large and using <math>O(n)</math> memory is undesirable. In 1985 Janet Blumer developed an algorithm to maintain a suffix automaton on a sliding window of size <math>k</math> in <math>O(nk)</math> worst-case and <math>O(n \log k)</math> on average, assuming that characters are distributed independently and [[Uniform distribution (discrete)|uniformly]]. She also showed that <math>O(nk)</math> complexity can't be improved: if one considers words construed as a concatenation of several <math>(ab)^m c (ab)^m d</math> words, where <math>k=6m+2</math>, then the number of states for the window of size <math>k</math> would frequently change with jumps of order <math>m</math>, which renders even theoretical improvement of <math>O(nk)</math> for regular suffix automata impossible.<ref>{{harvp|Blumer|1987|loc=|p=|pp=461—465}}</ref>
 
Same should be true for suffix tree as its vertices correspond to states of suffix automaton of reversed string, but this issue may be resolved by not storing explicitly every vertex corresponding to the suffix of the whole string, thus only storing vertices with at least 2 out-going edges. Variation of McCreight's suffix tree construction algorithm for this task was suggested in 1989 by Edward Fiala and Daniel Greene,<ref>{{harvp|Fiala, Greene|1989|p=490}}</ref> several years later similar result was obtained with the variation of Ukkonen's algorithm by Jesper Larsson<ref>{{harvp|Larsson|1996}}</ref><ref>{{harvp|Brodnik, Jekovec|2018|p=1}}</ref>. The existence of such algorithm for compacted suffix automaton which absorbs some properties of both suffix trees and suffix automata was an open question for a long time until it was found out by Martin Senft and Tomasz Dvorak in 2008 that it's impossible if the alphabet's size is at least 2.<ref>{{harvp|Senft, Dvořák|2008|p=109}}</ref>
 
One way to overcome this obstacle is to allow window width to variate a bit while staying <math>O(k)</math>. It may be achieved by an approximate algorithm suggested by Inenaga et al. in 2004. The window for which suffix automaton is built in this algorithm is not guaranteed to be of length <math>k</math>, but it's guaranteed to be at least <math>k</math> and at most <math>2k+1</math>while providing linear overall complexity of the algorithm.<ref>{{harvp|Inenaga et al.|2004}}</ref>
 
== Applications ==
Suffix automaton of the string <math>S</math> may be used to solve such problems as:<ref name=":4">{{harvp|Crochemore, Hancart|1997|страницы=|loc=|pp=39—41}}</ref><ref>{{harvp|Crochemore, Hancart|1997|pp=36—39|loc=}}</ref>
 
* Counting the number of distinct substrings of <math>S</math> in <math>O(|S|)</math> on-line,
The directed acyclic word graph (DAWG) of {{mvar|S}} can be obtained by taking the [[trie]] of suffixes of {{mvar|S}} and merging all [[graph isomorphism|isomorphic]] subtrees. The DAWG also has another deep connection with the suffix tree of {{mvar|S}}: The suffix pointers of the DAWG of {{mvar|S}} form a tree which is identical to the [[suffix tree]] of the reverse of {{mvar|S}}.
* Finding the longest substring of <math>S</math> occurring at least twice in <math>O(|S|)</math>,
* Finding the longest common substring of <math>S</math> and <math>T</math> in <math>O(|T|)</math>,
* Counting the number of occurrences of <math>T</math> in <math>S</math> in <math>O(|T|)</math>,
* Finding all occurrences of <math>T</math> in <math>S</math> in <math>O(|T|+k)</math>, where <math>k</math> is the number of occurrences.
 
It is assumed here that <math>T</math> is given on the input after suffix automaton of <math>S</math> is constructed.
Replacing all non-branching paths of the DAWG with a single edge gives a data structure that is called the compact DAWG, or the CDAWG of {{mvar|S}}. The CDAWG is identical to the [[directed acyclic graph|DAG]] obtained by merging all isomorphic subtrees of the suffix tree of {{mvar|S}}.
 
Suffix automata are also used in data compression,<ref>{{harvp|Yamamoto et al.|2014|loc=|p=675}}</ref> music retrieval<ref>{{harvp|Crochemore et al.|2003|loc=|p=211}}</ref><ref>{{harvp|Mohri et al.|2009|loc=|страницы=|p=3553}}</ref> and matching on genome sequences.<ref>{{harvp|Faro|2016|loc=|p=145}}</ref>
==See also==
* [[GADDAG]]
* [[Suffix array]]
* [[Suffix tree]]
 
==References==
Line 40 ⟶ 329:
 
==Further reading==
{{refbegin|2}}
* {{Cite Q|Q29541479|ref={{harvid|Weiner|1973}}}}
*{{citation | doi=10.1109/SPIRE.2001.989743|last1= Inenaga|first1= S.|last2= Hoshino|first2=H.|last3= Shinohara|first3= A. |last4= Takeda|first4= M. |last5= Arikawa|first5= S. |contribution= On-line construction of symmetric compact directed acyclic word graphs|title=Proc. 8th Int. Symp. String Processing and Information Retrieval, 2001. SPIRE 2001|year=2001|pages=96–110|isbn= 978-0-7695-1192-4|citeseerx= 10.1.1.799.9933}}.
* {{Cite Q|Q90300966|ref={{harvid|Pratt|1973}}}}
*{{citation | first1=Maxime | last1=Crochemore | first2=Renaud | last2=Vérin | contribution=Direct construction of compact directed acyclic word graphs | series=[[Lecture Notes in Computer Science]] | publisher=Springer-Verlag | title=Combinatorial Pattern Matching | volume=1264 | year=1997 | pages=116–129 | doi=10.1007/3-540-63220-4_55 | isbn=978-3-540-63220-7 | citeseerx=10.1.1.53.6273 }}.
* {{Cite Q|Q90305414|ref={{harvid|Slisenko|1983}}}}
*{{citation | last1=Epifanio | first1=Chiara | last2=Mignosi | first2=Filippo | last3=Shallit | first3=Jeffrey | last4=Venturini | first4=Ilaria | chapter=Sturmian graphs and a conjecture of Moser | pages=175–187 | editor1-last=Calude | editor1-first=Cristian S. | editor2-last=Calude | editor2-first=Elena | editor3-last=Dineen | editor3-first=Michael J. | title=Developments in language theory. Proceedings, 8th international conference (DLT 2004), Auckland, New Zealand, December 2004 | year=2004 | publisher=[[Springer-Verlag]] | series=Lecture Notes in Computer Science | volume=3340 | isbn=978-3-540-24014-3 | zbl=1117.68454 }}
* {{Cite Q|Q90309073|ref={{harvid|Blumer et al.|1984}}}}
*{{citation | first1=H.H. | last1=Do | first2=W.K. | last2=Sung | contribution=Compressed Directed Acyclic Word Graph with Application in Local Alignment | series=Lecture Notes in Computer Science | publisher=[[Springer-Verlag]] | title=Computing and Combinatorics | volume=6842 | pages=503–518 | doi=10.1007/978-3-642-22685-4_44 | year=2011 | isbn=978-3-642-22684-7 }}
* {{Cite Q|Q90311855|ref={{harvid|Blumer et al.|1987}}}}
* {{Cite Q|Q90327976|ref={{harvid|Blumer|1987}}}}
* {{Cite Q|Q90329833|ref={{harvid|Chen, Seiferas|1985}}}}
* {{Cite Q|Q90335534|ref={{harvid|Inenaga|2003}}}}
* {{Cite Q|Q57518591|ref={{harvid|Inenaga et al.|2005}}}}
* {{Cite Q|Q90341606|ref={{harvid|Inenaga et al.|2001}}}}
* {{Cite Q|Q90345535|ref={{harvid|Inenaga et al.|2004}}}}
* {{Cite Q|Q90348192|ref={{harvid|Yamamoto et al.|2014}}}}
* {{Cite Q|Q90410044|ref={{harvid|Fujishige et al.|2016}}}}
* {{Cite Q|Q90410808|ref={{harvid|Mohri et al.|2009}}}}
* {{Cite Q|Q90412338|ref={{harvid|Faro|2016}}}}
* {{Cite Q|Q90413384|ref={{harvid|Crochemore, Hancart|1997}}}}
* {{Cite Q|Q90413885|ref={{harvid|Crochemore, Vérin|1997}}}}
* {{Cite Q|Q90414195|ref={{harvid|Crochemore et al.|2003}}}}
* {{Cite Q|Q90418603|ref={{harvid|Hopcroft, Ullman|1979}}}}
* {{Cite Q|Q90425560|ref={{harvid|Fiala, Greene|1989}}}}
* {{Cite Q|Q90426624|ref={{harvid|Senft, Dvořák|2008}}}}
* {{Cite Q|Q90427112|ref={{harvid|Larsson|1996}}}}
* {{Cite Q|Q90431196|ref={{harvid|Brodnik, Jekovec|2018}}}}
* {{Cite Q|Q90726647|ref={{harvid|Rubinchik, Shur|2018}}}}
* {{Cite Q|Q90432456|ref={{harvid|Серебряков и др.|2006}}}}
* {{Cite Q|Q90435728|ref={{harvid|Рубцов|2019}}}}
* {{Cite Q|Q90436837|ref={{harvid|Паращенко|2007}}}}
{{refend}}
 
==External links==
*{{Commonscatinline}}
*[https://cp-algorithms.com/string/suffix-automaton.html Suffix automaton] article on E-Maxx Algorithms in English
 
{{Strings |state=collapsed}}