Suffix array: Difference between revisions

Content deleted Content added
i starts at 1 in the examples
Bender the Bot (talk | contribs)
m External links: HTTP to HTTPS for SourceForge
 
(23 intermediate revisions by 13 users not shown)
Line 1:
{{Short description|Data structure for a string}}
{{Infobox
| above = Suffix array
Line 20 ⟶ 21:
}}
}}
In [[computer science]], a '''suffix array''' is a sorted [[Array data structure|array]] of all [[Suffix (computer science)|suffixes]] of a [[String (computer science)|string]]. It is a data structure used in, among others, full -text indices, data -compression algorithms, and the field of [[bibliometrics]].
 
Suffix arrays were introduced by {{harvtxt|Manber|Myers|1990}} as a simple, space efficient alternative to [[suffix tree]]s. They had independently been discovered by [[Gaston Gonnet]] in 1987 under the name ''PAT array'' {{harv|Gonnet|Baeza-Yates|Snider|1992}}.
Line 26 ⟶ 27:
{{harvtxt|Li|Li|Huo|2016}} gave the first in-place <math>\mathcal{O}(n)</math> time suffix array construction algorithm that is optimal both in time and space, where ''in-place'' means that the algorithm only needs <math>\mathcal{O}(1)</math> additional space beyond the input string and the output suffix array.
 
[[Enhanced suffix array]]sarrays (ESAs) are suffix arrays with additional tables that reproduce the full functionality of suffix trees preserving the same time and memory complexity.{{sfn|Abouelhoda|Kurtz|Ohlebusch|2004}}
TheA suffixsorted array forof aonly subsetsome of(rather than all) suffixes of a string is called a [[sparse suffix array]].{{sfn|I|Kärkkäinen|Kempa|2014}} Multiple probabilistic algorithms have been developed to minimize the additional memory usage including an optimal time and memory algorithm.{{sfn|Gawrychowski|Kociumaka|2017}}
 
== Definition ==
Line 162 ⟶ 163:
* fast in practice
 
One of the first algorithms to achieve all goals is the SA-IS algorithm of {{harvtxt|Nong|Zhang|Chan|2009}}. The algorithm is also rather simple (&lt; 100 [[Source lines of code|LOC]]) and can be enhanced to simultaneously construct the [[LCP array]].{{sfn|Fischer|2011}} The SA-IS algorithm is one of the fastest known suffix array construction algorithms. A careful [implementation by Yuta Mori<ref>{{Cite web |last=Mori |first=Yuta |title=sais |url=https://sites.google.com/site/yuta256/sais implementation|url-status=dead by|archive-url=https://web.archive.org/web/20230309123010/https://sites.google.com/site/yuta256/sais Yuta|archive-date=9 Mori]Mar 2023 |access-date=31 Aug 2023}}</ref> outperforms most other linear or super-linear construction approaches.
 
Beside time and space requirements, suffix array construction algorithms are also differentiated by their supported [[Alphabet (computer science)|alphabet]]: ''constant alphabets'' where the alphabet size is bound by a constant, ''integer alphabets'' where characters are integers in a range depending on <math>n</math> and ''general alphabets'' where only character comparisons are allowed.{{sfn|Burkhardt|Kärkkäinen|2003}}
Line 175 ⟶ 176:
Recent work by {{harvtxt|Salson|Lecroq|Léonard|Mouchard|2010}} proposes an algorithm for updating the suffix array of a text that has been edited instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complexity is <math>\mathcal{O}(n \log n)</math>, it appears to perform well in practice: experimental results from the authors showed that their implementation of dynamic suffix arrays is generally more efficient than rebuilding when considering the insertion of a reasonable number of letters in the original text.
 
In practical [[open source]] work, a commonly used routine for suffix array construction was qsufsort, based on the 1999 Larsson-Sadakane algorithm.<ref>{{cite journal |last1=Larsson |first1=N. Jesper |last2=Sadakane |first2=Kunihiko |title=Faster suffix sorting |journal=Theoretical Computer Science |date=22 November 2007 |volume=387 |issue=3 |pages=258–272 |doi=10.1016/j.tcs.2007.07.017 |language=en |issn=0304-3975|doi-access=free }}</ref> This routine has been superseded by Yuta Mori's DivSufSort, "the fastest known suffix sorting algorithm in main memory" as of 2017. It too can be modified to compute an LCP array. It uses a induced copying combined with Itoh-Tanaka.<ref>{{cite journal |last1=Fischer |first1=Johannes |last2=Kurpicz |first2=Florian |title=Dismantling DivSufSort |arxiv=1710.01896 |date=5 October 2017 |journal=Proceedings of the Prague Stringology Conference 2017}}</ref> In 2021 a faster implementation of the algorithm was presented by Ilya Grebnov <ref>{{Cite web|title=New saca and bwt library (libsais)|url=https://encode.su/threads/3579-New-saca-and-bwt-library-(libsais)|access-date=2021-10-03|website=encode.su}}</ref> which in average showed 65% performance improvement over DivSufSort implementation on the [[Silesia Corpuscorpus]].<ref>{{Citation|last=Grebnov|first=Ilya|title=libsais|date=2021-09-22|url=https://github.com/IlyaGrebnov/libsais|access-date=2021-10-02}}</ref>
 
== Generalized Suffixsuffix Arrayarray ==
 
The concept of a suffix array can be extended to more than one string. This is called a generalized suffix array (or GSA), a suffix array that contains all suffixes for a set of strings (for example, <math>S=S_1,S_2,S_3,...,S_k</math> and is lexicographically sorted with all suffixes of each string.{{sfn|Shi|1996}}
Line 189 ⟶ 190:
<syntaxhighlight lang="python">
n = len(S)
 
def search(P: str) -> Tupletuple[int, int]:
"""
Return indices (s, r) such that the interval A[s:r] (including the end
Line 216 ⟶ 218:
return (s, r)
</syntaxhighlight>
Finding the substring pattern <math>P</math> of length <math>m</math> in the string <math>S</math> of length <math>n</math> takes <math>\mathcal{O}(m \log n)</math> time, given that a single suffix comparison needs to compare <math>m</math> characters. {{harvtxt|Manber|Myers|1990}} describe how this bound can be improved to <math>\mathcal{O}(m + \log n)</math> time using [[LCP array|LCP]] information. The idea is that a pattern comparison does not need to re-compare certain characters, when it is already known that these are part of the longest common prefix of the pattern and the current search interval. {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} improve the bound even further and achieve a search time of <math>\mathcal{O}(m)</math> for constant alphabet size, as known from [[suffix tree]]s.
 
Suffix sorting algorithms can be used to compute the [[Burrows–Wheeler transform|Burrows–Wheeler transform (BWT)]]. The [[Burrows–Wheeler transform|BWT]] requires sorting of all cyclic permutations of a string. If this string ends in a special end-of-string character that is lexicographically smaller than all other character (i.e., $), then the order of the sorted rotated [[Burrows–Wheeler transform|BWT]] matrix corresponds to the order of suffixes in a suffix array. The [[Burrows–Wheeler transform|BWT]] can therefore be computed in linear time by first constructing a suffix array of the text and then deducing the [[Burrows–Wheeler transform|BWT]] string: <math>BWT[i] = S[A[i]-1]</math>.
Line 224 ⟶ 226:
Many additional applications of the suffix array require the [[LCP array]]. Some of these are detailed in the [[LCP array#Applications|application section]] of the latter.
 
== Enhanced Suffixsuffix Arraysarrays ==
Suffix trees are powerful data structures that have wide application in areas of pattern and string matching, indexing and textual statistics. However, it occupies a significant amount of space and thus has a drawback in many real-time applications that requiresrequire processing a considerably large amount of data like genome analysis. To overcome this drawback, Enhanced Suffix Arrays were developed that are data structures consisting of suffix arrays and an additional table called the child table that contains the information about the parent-child relationship between the nodes in the suffix tree. The node branching data structure for this tree is a linked list. Enhanced suffix treesarrays are superior in terms of both space efficiency and time complexity and are easy to implement. Moreover, they can be applied to any algorithm that uses a suffix tree by using an abstract concept lcp-interval trees. The time complexity for searching a pattern in an enhanced suffix array is O(m|Σ|).
 
The suffix array of the string is an array of n integers in the range of 0 to n that represents the n+1 suffixes of the string including the special character #.
Line 236 ⟶ 238:
== Constructing the lcp-interval ==
For a suffix array of S, the lcp-interval associated with the corresponding node of suffix tree of S can be defined as:
{{quote|1=
''Interval [i,..j], 0 ≤ i ≤ j ≤ n is an lcp-interval of lcp-value, if''
 
''1.# lcptab[i] < l,''
''Interval [i,..j], 0 ≤ i ≤ j ≤ n is an lcp-interval of lcp-value, if''
''2.# lcptab[k] ≥ l for all i + 1 ≤ k ≤ j,''
 
''3.# lcptab[k] = l for some i + 1 ≤ k ≤ j if i ≠ j and l = n − i + 1 if i = j,''
''1. lcptab[i] < l,''
''4.# lcptab[j + 1] < l.''
 
}}
''2. lcptab[k] ≥ l for all i + 1 ≤ k ≤ j,''
 
''3. lcptab[k] = l for some i + 1 ≤ k ≤ j if i ≠ j and l = n − i + 1 if i = j,''
 
''4. lcptab[j + 1] < l.''
 
The length of the longest common prefix of pos[i − 1] and pos[i] is stored in lcp[i],where 2 ≤ i ≤ n. The lcp-interval portrays the same parent-child relationship as that among the associated nodes in the suffix tree of S.This shows that if the corresponding node of [i..j] is a child of the corresponding node of [k..l], a lcp-interval [i..j] is a child interval of another lcp-interval [k..l]. If [k..l] is a child interval of [i..j], a lcp-interval [i..j] is the parent interval of a lcp-interval [k..l].
 
== Constructing a Childchild Tabletable ==
The child table ''cldtab'' is composed of three n arrays, ''up'', ''down'' and ''nextlIndex''. The information about the edges of the corresponding suffix tree is stored inand maintained by the ''up'' and ''down'' arrayarrays. The ''nextlIndexarraynextlIndex'' array stores the links in the linked list used for node branching the suffix tree.
 
The ''up'', ''down'' and ''nextlIndex'' array are defined as follows:
 
# The element ''up[i]'' records the starting index of the longest lcp-second interval’s child interval, which ends at index ''i-1''.
# The initial index of the second child interval of the longest lcp-interval, starting at index ''i'' is stored in the element ''down[i]''.
# If and only if the interval is neither the first child nor the final child of its parent, the element ''nextlIndex[i]'' contains the first index of the next sibling interval of the longest lcp-interval, starting at index ''i''.
Line 260 ⟶ 259:
By performing a bottom-up traversal of the lcp-interval of the tree, the child table can be constructed in linear time. The ''up/down'' values and the ''nextlIndex'' values can be computed separately by using two distinct algorithms.
 
== Constructing a Suffixsuffix Linklink Tabletable ==
The suffix links for an enhanced suffix array can be computed by generating the suffix link interval [''1,..,r''] for each [i,..j] interval during the preprocessing. The left and right elements l and r of the interval are maintained in the first index of [i,..,j]. The table for this interval ranges from 0 to n. The suffix link table is constructed by the left-to-right breadth-first traversal of the lcp-interval tree. Every time an ''l''-interval is computed, it is added to the list of l-intervals, which is referred to as the l-list. When the lcp-value > 0, for every ''l''-interval[i,..,j] in the list, link[i] is calculated. The interval [''l'',..,''r''] is computed by a binary search in(''l''-1)-list, where ''l'' is the largest left boundary amongst all the ''l''-1 intervals. The suffix link interval of [i,..j] is represented by this interval[''l,..,r'']. The values ''l'' and ''r'' are ultimately stored in the first index of [i,..,j].
 
Line 273 ⟶ 272:
| pages = 319–327
| url = http://dl.acm.org/citation.cfm?id=320176.320218
| last1 = Manber | first1 = Udi | author1-link = Udi_ManberUdi Manber
| last2 = Myers | first2 = Gene | author2-link = Gene_MyersGene Myers
}}
* {{cite journal
Line 285 ⟶ 284:
| doi = 10.1137/0222058
| url = http://dl.acm.org/citation.cfm?id=320176.320218
| last1 = Manber | first1 = Udi | author1-link = Udi_ManberUdi Manber
| last2 = Myers | first2 = Gene | s2cid = 5074629
| author2-link = Gene_MyersGene Myers
}}
* {{Cite journal|last1=Gawrychowski|first1=Paweł|last2=Kociumaka|first2=Tomasz|date=January 2017 |title=Sparse Suffix Tree Construction in Optimal Time and Space|journal=Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms|___location=Philadelphia, PA|publisher=Society for Industrial and Applied Mathematics|pages=425–439|isbn=9781611974782|arxiv=1608.00865|doi=10.1137/1.9781611974782.27 |s2cid=6608776}}
* {{cite conference |last1=Li|first1=Zhize|last2=Li|first2=Jian|last3=Huo|first3=Hongwei
|doi = 10.1007/978-3-030-00479-8_22
Line 302 ⟶ 300:
|pages = 268–284
}}
* {{cite conference|last1=Shi|first1=Fei|title=Concurrency and Parallelism, Programming, Networking, and Security |chapter=Suffix arrays for multiple strings: A method for on-line multiple string searches|date=1996|series=Lecture Notes in Computer Science|volume=1179 |publisher=Springer Berlin Heidelberg |pages=11–22|doi=10.1007/BFb0027775|isbn=978-3-540-62031-0}}
* {{cite conference
| doi = 10.1007/3-540-45784-4_35|title=The Enhanced Suffix Array and Its Applications to Genome Analysis |conference=Algorithms in Bioinformatics|series=[[Lecture Notes in Computer Science]]|year=2002 |last1=Abouelhoda|first1=Mohamed Ibrahim|last2=Kurtz|first2=Stefan|last3=Ohlebusch|first3=Enno|isbn=978-3-540-44211-0|volume=2452}}
Line 322 ⟶ 320:
| year = 1992
| journal = Information Retrieval: Data Structures and Algorithms
| last1 = Gonnet | first1 = G.H.
| last2 = Baeza-Yates | first2 = R.A.
| last3 = Snider | first3 = T.
| url = http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap05.htm
}}
Line 346 ⟶ 344:
}}
* {{cite conference
| doi = 10.1007/978-3-642-22300-6_32|title=Inducing the LCP-Array|conference=Algorithms and Data Structures|series=Lecture Notes in Computer Science|year=2011|last1=Fischer|first1=Johannes|isbn=978-3-642-22299-3|volume=6844|pages=374374–385
|arxiv=1101.3448}}
* {{cite journal
Line 352 ⟶ 350:
|doi-access=free}}
* {{cite conference
| doi = 10.1007/3-540-44888-8_5|title=Fast Lightweight Suffix Array Construction and Checking|conference=Combinatorial Pattern Matching|series=Lecture Notes in Computer Science|year=2003|last1=Burkhardt|first1=Stefan|last2=Kärkkäinen|first2=Juha|isbn=978-3-540-40311-1|volume=2676|pages=5555–69
}}
* {{cite conference
| doi=10.1145/800152.804905|title=Rapid identification of repeated patterns in strings, trees and arrays |conference=Proceedings of the fourth annual ACM symposium on Theory of computing - STOC '72|year=1972 |last1=Karp|first1=Richard M.|last2=Miller|first2=Raymond E.|last3=Rosenberg|first3=Arnold L.|pages=125125–136
}}
* {{cite conference
| doi = 10.1109/SFCS.1997.646102|title=Optimal suffix tree construction with large alphabets |conference=Proceedings 38th Annual Symposium on Foundations of Computer Science|year=1997 |last1=Farach|first1=M.|isbn=0-8186-8197-7
}}
* {{cite conference|last1=I|first1=Tomohiro|last2=Kärkkäinen|first2=Juha|last3=Kempa|first3=Dominik|title=Faster Sparse Suffix Sorting |date=2014|series=Leibniz International Proceedings in Informatics (LIPIcs) |volume=25 |pages=386-396386–396 |publisher=Schloss Dagstuhl – Leibniz-Zentrum fuer Informatik |doi=10.4230/LIPIcs.STACS.2014.386 |doi-access=free |isbn=978-3-939897-65-1
}}
* {{cite conference
Line 369 ⟶ 367:
* {{cite journal
| doi = 10.1016/j.parco.2007.06.004|title=Scalable parallel suffix array construction|year=2007 |last1=Kulla|first1=Fabian|last2=Sanders|first2=Peter|author2-link=Peter Sanders (computer scientist) |journal=Parallel Computing |volume=33|issue=9
|pages=605–612 }}
}}
*Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. "Replacing suffix trees with enhanced suffix arrays." ''Journal of Discrete Algorithms'', 2(1):53–86, 2004.
*Dong Kyue Kim, Jeong Eun Jeon, and Heejin Park. "An efficient index data structure with the capabilities of suffix trees and suffix arrays for alphabets of non-negligible size." ''String Processing and Information Retrieval Lecture Notes in Computer Science'', page138–149, 2004.
Line 375 ⟶ 373:
==External links==
{{Commons category}}
* [http://algs4.cs.princeton.edu/63suffix/SuffixArray.java.html Suffix Array in Java]
* [http://code.google.com/p/compression-code/downloads/list Suffix sorting module for BWT in C code]
* [http://www.codeodor.com/index.cfm/2007/12/24/The-Suffix-Array/1845 Suffix Array Implementation in Ruby]
* [httphttps://sary.sourceforge.net/index.html.en Suffix array library and tools]
* [http://pizzachili.dcc.uchile.cl/ Project containing various Suffix Array c/c++ Implementations with a unified interface]
* [httphttps://code.googlegithub.com/py-256/libdivsufsort/ A fast, lightweight, and robust C API library to construct the suffix array]
* [http://code.google.com/p/pysuffix/ Suffix Array implementation in Python]
* [http://www.geeksforgeeks.org/suffix-tree-application-4-build-linear-time-suffix-array/ Linear Time Suffix Array implementation in C using suffix tree]
 
{{Strings}}
 
[[Category:Articles with example Python (programming language) code]]
[[Category:Arrays]]
[[Category:Substring indices]]