Revision as of 17:42, 31 May 2005 edit 204.108.8.5 (talk) Forgot the 3/racadabra in the sorted list ← Previous edit		Revision as of 21:56, 21 July 2005 edit undo Nils Grimsmo (talk \| contribs) 567 edits Added more "to the point" introduction. Example formatting. You search for something that _might_ be a substring. Added refecences. Next edit →
Line 1: A '''suffix array''' ~~(invented~~is byan [[~~Gene~~index ~~Myers~~(information technology)\|index structure]] ~~and~~for locating [[~~Udi Manber~~substring]]s in a larger [[string (computer science)\|string]]. It is an ~~indexing~~array ~~structure~~of indexes giving the [[lexicographical order]] of the [[suffix]]es. A suffix array for ~~locating~~a ~~substrings~~string ''S'' of length ''n'' can be built in Θ(''n'') time, and lets you find all ''occ'' occurrences of a ~~larger~~pattern ~~string~~''P'' of length ''m'' in ''S'' in O(''m'' + log ''n'' + occ) time. Suffix ~~Originally~~arrays where originally developed by [[Gene Myers]] and [[Udi Manber]] to reduce memory consumption compared to a [[suffix tree]],. ~~this structure~~This began the trend towards compressed suffix arrays and [[BWT]]-based compressed full-text indices. The ''~~kth suffix~~k''th suffix of a string ''S'' is ''S'' with the first ''k''-1 characters removed, for some starting position ''k''. A string of ''n'' characters has ''n'' suffixes, denoted by their starting positions 1..''n''. For instance, the string "abracadabra" has the following suffixes: ~~For~~ ~~instance,~~ ~~the~~1 ~~string~~ "abracadabra~~" has the following set of suffixes:~~ 2 bracadabra▼ 3 racadabra▼ 4 acadabra▼ 5 cadabra▼ 6 adabra▼ 7 dabra▼ 8 abra▼ 9 bra▼ 10 ra▼ 11 a▼ The first step in building a suffix array is to sort the suffixes lexicographically, ~~yielding~~giving:▼ ~~<code><pre>~~ 1 abracadabra▼ ▲2 bracadabra ▲3 racadabra ▲4 acadabra ▲5 cadabra ▲6 adabra ▲7 dabra ▲8 abra ▲9 bra ▲10 ra ▲11 a ~~</pre></code>~~ 1 11 a ▲The first step in building a suffix array is to sort the suffixes lexicographically, yielding: 2 8 abra▼ ▲ 3 1 abracadabra 4 4 acadabra▼ 5 6 adabra▼ 6 9 bra▼ 7 2 bracadabra▼ 8 5 cadabra▼ 9 7 dabra▼ 10 10 ra▼ 11 3 racadabra▼ To find where a given pattern ''P'' is a substring of ''S'', two binary searches are used to find the range of suffixes [[prefix]]ed by ''P''. The output of the search is the list of starting positions in this range. A search for "br" in the given example would give a left border 6 and a right border 7, giving suffixes 9 and 2. This means that "br" is a substring of "abracadabra" both at position 2 and 9. If implemented straight forward, this binary search takes O(''m'' log ''n'') time, as most of the pattern P is compared at every step in the binary search in the wost case. To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCP's) of suffixes are constructed, giving O(''m'' + log ''n'') search time. ~~<code><pre>~~ ~~11 a~~ ▲8 abra ~~1 abracadabra~~ ▲4 acadabra ▲6 adabra ▲9 bra ▲2 bracadabra ▲5 cadabra ▲7 dabra ▲10 ra ▲3 racadabra ~~</pre></code>~~ The key insight of the suffix array is to denote each suffix by its starting position only (the ~~left~~second column above). The ~~resultant~~resulting array of numbers, combined with the original string, is a compact representation of the sorted suffix list, consuming one character and one integer for each character in the string.▼ To find a given substring X, two binary searches are used to find the range of suffixes prefixed by X. To gain efficiency, the string comparisons in the binary searches are modified to take advantage of longest common prefixes (LCP's) as the search narrows. The output of the search is the list of starting positions for the suffixes prefixed by X. There are many suffix array construction algorithms, with different properties. Some O(''n''<sup>2</sup>) construction algorithms are faster than the Θ(n) algorithms in practise. ▲The key insight of the suffix array is to denote each suffix by its starting position only (the left column above). The resultant array of numbers, combined with the original string, is a compact representation of the sorted suffix list, consuming one character and one integer for each character in the string. ==References== In 2004, Ko et al presented an algorithm for constructing a suffix array ''de novo'' in time proportional to the length of the source string. Suffix arrays may be converted to and from [[Suffix tree]]s in linear time as well. * Udi Manber and Gene Myers (1991). "Suffix arrays: a new method for on-line string searches". ''SIAM Journal on Computing'', Volume 22, Issue 5 (October 1993), pp. 935-948. * Juha Kärkkäinen and Peter Sanders (2003). "Simple linear work suffix array construction." In ''Proc. 30th International Colloquium on Automata, Languages and Programming (ICALP '03)''. LNCS 2719, Springer, 2003, pp. 943-955. * Klaus-Bernd Schürmann and Jens Stoye (2005). "An incomplex algorithm for fast suffix array construction". In ''Proceedings of ALENEX'', 2005. [[Category:Data structure]] [[Category:Algorithms on strings]]

Suffix array: Difference between revisions