Suffix array: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Alter: author1-link, author2-link. Added doi-access. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox2 | #UCB_webform_linked 85/188
Bender the Bot (talk | contribs)
m External links: HTTP to HTTPS for SourceForge
 
(7 intermediate revisions by 4 users not shown)
Line 28:
 
Enhanced suffix arrays (ESAs) are suffix arrays with additional tables that reproduce the full functionality of suffix trees preserving the same time and memory complexity.{{sfn|Abouelhoda|Kurtz|Ohlebusch|2004}}
TheA suffixsorted array forof aonly subsetsome of(rather than all) suffixes of a string is called a [[sparse suffix array]].{{sfn|I|Kärkkäinen|Kempa|2014}} Multiple probabilistic algorithms have been developed to minimize the additional memory usage including an optimal time and memory algorithm.{{sfn|Gawrychowski|Kociumaka|2017}}
 
== Definition ==
Line 176:
Recent work by {{harvtxt|Salson|Lecroq|Léonard|Mouchard|2010}} proposes an algorithm for updating the suffix array of a text that has been edited instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complexity is <math>\mathcal{O}(n \log n)</math>, it appears to perform well in practice: experimental results from the authors showed that their implementation of dynamic suffix arrays is generally more efficient than rebuilding when considering the insertion of a reasonable number of letters in the original text.
 
In practical [[open source]] work, a commonly used routine for suffix array construction was qsufsort, based on the 1999 Larsson-Sadakane algorithm.<ref>{{cite journal |last1=Larsson |first1=N. Jesper |last2=Sadakane |first2=Kunihiko |title=Faster suffix sorting |journal=Theoretical Computer Science |date=22 November 2007 |volume=387 |issue=3 |pages=258–272 |doi=10.1016/j.tcs.2007.07.017 |language=en |issn=0304-3975|doi-access=free }}</ref> This routine has been superseded by Yuta Mori's DivSufSort, "the fastest known suffix sorting algorithm in main memory" as of 2017. It too can be modified to compute an LCP array. It uses a induced copying combined with Itoh-Tanaka.<ref>{{cite journal |last1=Fischer |first1=Johannes |last2=Kurpicz |first2=Florian |title=Dismantling DivSufSort |arxiv=1710.01896 |date=5 October 2017 |journal=Proceedings of the Prague Stringology Conference 2017}}</ref> In 2021 a faster implementation of the algorithm was presented by Ilya Grebnov <ref>{{Cite web|title=New saca and bwt library (libsais)|url=https://encode.su/threads/3579-New-saca-and-bwt-library-(libsais)|access-date=2021-10-03|website=encode.su}}</ref> which in average showed 65% performance improvement over DivSufSort implementation on the [[Silesia Corpuscorpus]].<ref>{{Citation|last=Grebnov|first=Ilya|title=libsais|date=2021-09-22|url=https://github.com/IlyaGrebnov/libsais|access-date=2021-10-02}}</ref>
 
== Generalized suffix array ==
Line 190:
<syntaxhighlight lang="python">
n = len(S)
 
def search(P: str) -> Tupletuple[int, int]:
"""
Return indices (s, r) such that the interval A[s:r] (including the end
Line 237 ⟶ 238:
== Constructing the lcp-interval ==
For a suffix array of S, the lcp-interval associated with the corresponding node of suffix tree of S can be defined as:
{{quote|1=
''Interval [i,..j], 0 ≤ i ≤ j ≤ n is an lcp-interval of lcp-value, if''
 
''1.# lcptab[i] < l,''
''Interval [i,..j], 0 ≤ i ≤ j ≤ n is an lcp-interval of lcp-value, if''
''2.# lcptab[k] ≥ l for all i + 1 ≤ k ≤ j,''
 
''3.# lcptab[k] = l for some i + 1 ≤ k ≤ j if i ≠ j and l = n − i + 1 if i = j,''
''1. lcptab[i] < l,''
''4.# lcptab[j + 1] < l.''
 
}}
''2. lcptab[k] ≥ l for all i + 1 ≤ k ≤ j,''
 
''3. lcptab[k] = l for some i + 1 ≤ k ≤ j if i ≠ j and l = n − i + 1 if i = j,''
 
''4. lcptab[j + 1] < l.''
 
The length of the longest common prefix of pos[i − 1] and pos[i] is stored in lcp[i],where 2 ≤ i ≤ n. The lcp-interval portrays the same parent-child relationship as that among the associated nodes in the suffix tree of S.This shows that if the corresponding node of [i..j] is a child of the corresponding node of [k..l], a lcp-interval [i..j] is a child interval of another lcp-interval [k..l]. If [k..l] is a child interval of [i..j], a lcp-interval [i..j] is the parent interval of a lcp-interval [k..l].
 
Line 290 ⟶ 288:
| author2-link = Gene Myers
}}
* {{Cite journal|last1=Gawrychowski|first1=Paweł|last2=Kociumaka|first2=Tomasz|date=January 2017 |title=Sparse Suffix Tree Construction in Optimal Time and Space|journal=Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms|___location=Philadelphia, PA|publisher=Society for Industrial and Applied Mathematics|pages=425–439|isbn=9781611974782|arxiv=1608.00865|doi=10.1137/1.9781611974782.27 |s2cid=6608776}}
* {{cite conference |last1=Li|first1=Zhize|last2=Li|first2=Jian|last3=Huo|first3=Hongwei
|doi = 10.1007/978-3-030-00479-8_22
Line 376 ⟶ 373:
==External links==
{{Commons category}}
* [http://algs4.cs.princeton.edu/63suffix/SuffixArray.java.html Suffix Array in Java]
* [http://code.google.com/p/compression-code/downloads/list Suffix sorting module for BWT in C code]
* [http://www.codeodor.com/index.cfm/2007/12/24/The-Suffix-Array/1845 Suffix Array Implementation in Ruby]
* [httphttps://sary.sourceforge.net/index.html.en Suffix array library and tools]
* [http://pizzachili.dcc.uchile.cl/ Project containing various Suffix Array c/c++ Implementations with a unified interface]
* [https://github.com/y-256/libdivsufsort A fast, lightweight, and robust C API library to construct the suffix array]
* [http://code.google.com/p/pysuffix/ Suffix Array implementation in Python]
* [http://www.geeksforgeeks.org/suffix-tree-application-4-build-linear-time-suffix-array/ Linear Time Suffix Array implementation in C using suffix tree]
 
{{Strings}}
 
[[Category:Articles with example Python (programming language) code]]
[[Category:Arrays]]
[[Category:Substring indices]]