Content deleted Content added
MOS:HEAD |
m →External links: HTTP to HTTPS for SourceForge |
||
(12 intermediate revisions by 9 users not shown) | |||
Line 1:
{{Short description|Data structure for a string}}
{{Infobox
| above = Suffix array
Line 26 ⟶ 27:
{{harvtxt|Li|Li|Huo|2016}} gave the first in-place <math>\mathcal{O}(n)</math> time suffix array construction algorithm that is optimal both in time and space, where ''in-place'' means that the algorithm only needs <math>\mathcal{O}(1)</math> additional space beyond the input string and the output suffix array.
== Definition ==
Line 162 ⟶ 163:
* fast in practice
One of the first algorithms to achieve all goals is the SA-IS algorithm of {{harvtxt|Nong|Zhang|Chan|2009}}. The algorithm is also rather simple (< 100 [[Source lines of code|LOC]]) and can be enhanced to simultaneously construct the [[LCP array]].{{sfn|Fischer|2011}} The SA-IS algorithm is one of the fastest known suffix array construction algorithms. A careful
Beside time and space requirements, suffix array construction algorithms are also differentiated by their supported [[Alphabet (computer science)|alphabet]]: ''constant alphabets'' where the alphabet size is bound by a constant, ''integer alphabets'' where characters are integers in a range depending on <math>n</math> and ''general alphabets'' where only character comparisons are allowed.{{sfn|Burkhardt|Kärkkäinen|2003}}
Line 175 ⟶ 176:
Recent work by {{harvtxt|Salson|Lecroq|Léonard|Mouchard|2010}} proposes an algorithm for updating the suffix array of a text that has been edited instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complexity is <math>\mathcal{O}(n \log n)</math>, it appears to perform well in practice: experimental results from the authors showed that their implementation of dynamic suffix arrays is generally more efficient than rebuilding when considering the insertion of a reasonable number of letters in the original text.
In practical [[open source]] work, a commonly used routine for suffix array construction was qsufsort, based on the 1999 Larsson-Sadakane algorithm.<ref>{{cite journal |last1=Larsson |first1=N. Jesper |last2=Sadakane |first2=Kunihiko |title=Faster suffix sorting |journal=Theoretical Computer Science |date=22 November 2007 |volume=387 |issue=3 |pages=258–272 |doi=10.1016/j.tcs.2007.07.017 |language=en |issn=0304-3975|doi-access=free }}</ref> This routine has been superseded by Yuta Mori's DivSufSort, "the fastest known suffix sorting algorithm in main memory" as of 2017. It too can be modified to compute an LCP array. It uses a induced copying combined with Itoh-Tanaka.<ref>{{cite journal |last1=Fischer |first1=Johannes |last2=Kurpicz |first2=Florian |title=Dismantling DivSufSort |arxiv=1710.01896 |date=5 October 2017 |journal=Proceedings of the Prague Stringology Conference 2017}}</ref> In 2021 a faster implementation of the algorithm was presented by Ilya Grebnov <ref>{{Cite web|title=New saca and bwt library (libsais)|url=https://encode.su/threads/3579-New-saca-and-bwt-library-(libsais)|access-date=2021-10-03|website=encode.su}}</ref> which in average showed 65% performance improvement over DivSufSort implementation on the [[Silesia
== Generalized suffix array ==
Line 189 ⟶ 190:
<syntaxhighlight lang="python">
n = len(S)
def search(P: str) ->
"""
Return indices (s, r) such that the interval A[s:r] (including the end
Line 236 ⟶ 238:
== Constructing the lcp-interval ==
For a suffix array of S, the lcp-interval associated with the corresponding node of suffix tree of S can be defined as:
{{quote|1=
▲''Interval [i,..j], 0 ≤ i ≤ j ≤ n is an lcp-interval of lcp-value, if''
▲''1. lcptab[i] < l,''
}}
▲''2. lcptab[k] ≥ l for all i + 1 ≤ k ≤ j,''
▲''3. lcptab[k] = l for some i + 1 ≤ k ≤ j if i ≠ j and l = n − i + 1 if i = j,''
▲''4. lcptab[j + 1] < l.''
The length of the longest common prefix of pos[i − 1] and pos[i] is stored in lcp[i],where 2 ≤ i ≤ n. The lcp-interval portrays the same parent-child relationship as that among the associated nodes in the suffix tree of S.This shows that if the corresponding node of [i..j] is a child of the corresponding node of [k..l], a lcp-interval [i..j] is a child interval of another lcp-interval [k..l]. If [k..l] is a child interval of [i..j], a lcp-interval [i..j] is the parent interval of a lcp-interval [k..l].
Line 273 ⟶ 272:
| pages = 319–327
| url = http://dl.acm.org/citation.cfm?id=320176.320218
| last1 = Manber | first1 = Udi | author1-link =
| last2 = Myers | first2 = Gene | author2-link =
}}
* {{cite journal
Line 285 ⟶ 284:
| doi = 10.1137/0222058
| url = http://dl.acm.org/citation.cfm?id=320176.320218
| last1 = Manber | first1 = Udi | author1-link =
| last2 = Myers | first2 = Gene | s2cid = 5074629
| author2-link =
}}
* {{cite conference |last1=Li|first1=Zhize|last2=Li|first2=Jian|last3=Huo|first3=Hongwei
|doi = 10.1007/978-3-030-00479-8_22
Line 360 ⟶ 358:
| doi = 10.1109/SFCS.1997.646102|title=Optimal suffix tree construction with large alphabets |conference=Proceedings 38th Annual Symposium on Foundations of Computer Science|year=1997 |last1=Farach|first1=M.|isbn=0-8186-8197-7
}}
* {{cite conference|last1=I|first1=Tomohiro|last2=Kärkkäinen|first2=Juha|last3=Kempa|first3=Dominik|title=Faster Sparse Suffix Sorting |date=2014|series=Leibniz International Proceedings in Informatics (LIPIcs) |volume=25 |pages=386–396 |publisher=Schloss Dagstuhl – Leibniz-Zentrum fuer Informatik |doi=10.4230/LIPIcs.STACS.2014.386 |doi-access=free |isbn=978-3-939897-65-1
}}
* {{cite conference
Line 375 ⟶ 373:
==External links==
{{Commons category}}
* [http://algs4.cs.princeton.edu/63suffix/SuffixArray.java.html Suffix Array in Java]
* [http://code.google.com/p/compression-code/downloads/list Suffix sorting module for BWT in C code]
* [http://www.codeodor.com/index.cfm/2007/12/24/The-Suffix-Array/1845 Suffix Array Implementation in Ruby]
* [
* [http://pizzachili.dcc.uchile.cl/ Project containing various Suffix Array c/c++ Implementations with a unified interface]
* [https://github.com/y-256/libdivsufsort A fast, lightweight, and robust C API library to construct the suffix array]
* [http://code.google.com/p/pysuffix/ Suffix Array implementation in Python]
* [http://www.geeksforgeeks.org/suffix-tree-application-4-build-linear-time-suffix-array/ Linear Time Suffix Array implementation in C using suffix tree]
{{Strings}}
[[Category:Articles with example Python (programming language) code]]
[[Category:Arrays]]
[[Category:Substring indices]]
|