Suffix array: Difference between revisions

Content deleted Content added
Applications: important clarification on alphabet size
MOS:HEAD
Line 177:
In practical [[open source]] work, a commonly used routine for suffix array construction was qsufsort, based on the 1999 Larsson-Sadakane algorithm.<ref>{{cite journal |last1=Larsson |first1=N. Jesper |last2=Sadakane |first2=Kunihiko |title=Faster suffix sorting |journal=Theoretical Computer Science |date=22 November 2007 |volume=387 |issue=3 |pages=258–272 |doi=10.1016/j.tcs.2007.07.017 |language=en |issn=0304-3975|doi-access=free }}</ref> This routine has been superseded by Yuta Mori's DivSufSort, "the fastest known suffix sorting algorithm in main memory" as of 2017. It too can be modified to compute an LCP array. It uses a induced copying combined with Itoh-Tanaka.<ref>{{cite journal |last1=Fischer |first1=Johannes |last2=Kurpicz |first2=Florian |title=Dismantling DivSufSort |arxiv=1710.01896 |date=5 October 2017 |journal=Proceedings of the Prague Stringology Conference 2017}}</ref> In 2021 a faster implementation of the algorithm was presented by Ilya Grebnov <ref>{{Cite web|title=New saca and bwt library (libsais)|url=https://encode.su/threads/3579-New-saca-and-bwt-library-(libsais)|access-date=2021-10-03|website=encode.su}}</ref> which in average showed 65% performance improvement over DivSufSort implementation on Silesia Corpus.<ref>{{Citation|last=Grebnov|first=Ilya|title=libsais|date=2021-09-22|url=https://github.com/IlyaGrebnov/libsais|access-date=2021-10-02}}</ref>
 
== Generalized Suffixsuffix Arrayarray ==
 
The concept of a suffix array can be extended to more than one string. This is called a generalized suffix array (or GSA), a suffix array that contains all suffixes for a set of strings (for example, <math>S=S_1,S_2,S_3,...,S_k</math> and is lexicographically sorted with all suffixes of each string.{{sfn|Shi|1996}}
Line 224:
Many additional applications of the suffix array require the [[LCP array]]. Some of these are detailed in the [[LCP array#Applications|application section]] of the latter.
 
== Enhanced Suffixsuffix Arraysarrays ==
Suffix trees are powerful data structures that have wide application in areas of pattern and string matching, indexing and textual statistics. However, it occupies a significant amount of space and thus has a drawback in many real-time applications that require processing a considerably large amount of data like genome analysis. To overcome this drawback, Enhanced Suffix Arrays were developed that are data structures consisting of suffix arrays and an additional table called the child table that contains the information about the parent-child relationship between the nodes in the suffix tree. The node branching data structure for this tree is a linked list. Enhanced suffix arrays are superior in terms of both space efficiency and time complexity and are easy to implement. Moreover, they can be applied to any algorithm that uses a suffix tree by using an abstract concept lcp-interval trees. The time complexity for searching a pattern in an enhanced suffix array is O(m|Σ|).
 
Line 249:
The length of the longest common prefix of pos[i − 1] and pos[i] is stored in lcp[i],where 2 ≤ i ≤ n. The lcp-interval portrays the same parent-child relationship as that among the associated nodes in the suffix tree of S.This shows that if the corresponding node of [i..j] is a child of the corresponding node of [k..l], a lcp-interval [i..j] is a child interval of another lcp-interval [k..l]. If [k..l] is a child interval of [i..j], a lcp-interval [i..j] is the parent interval of a lcp-interval [k..l].
 
== Constructing a Childchild Tabletable ==
The child table ''cldtab'' is composed of three n arrays, ''up'', ''down'' and ''nextlIndex''. The information about the edges of the corresponding suffix tree is stored and maintained by the ''up'' and ''down'' arrays. The ''nextlIndex'' array stores the links in the linked list used for node branching the suffix tree.
 
Line 260:
By performing a bottom-up traversal of the lcp-interval of the tree, the child table can be constructed in linear time. The ''up/down'' values and the ''nextlIndex'' values can be computed separately by using two distinct algorithms.
 
== Constructing a Suffixsuffix Linklink Tabletable ==
The suffix links for an enhanced suffix array can be computed by generating the suffix link interval [''1,..,r''] for each [i,..j] interval during the preprocessing. The left and right elements l and r of the interval are maintained in the first index of [i,..,j]. The table for this interval ranges from 0 to n. The suffix link table is constructed by the left-to-right breadth-first traversal of the lcp-interval tree. Every time an ''l''-interval is computed, it is added to the list of l-intervals, which is referred to as the l-list. When the lcp-value > 0, for every ''l''-interval[i,..,j] in the list, link[i] is calculated. The interval [''l'',..,''r''] is computed by a binary search in(''l''-1)-list, where ''l'' is the largest left boundary amongst all the ''l''-1 intervals. The suffix link interval of [i,..j] is represented by this interval[''l,..,r'']. The values ''l'' and ''r'' are ultimately stored in the first index of [i,..,j].