Suffix array: Difference between revisions

Content deleted Content added
s/wost/worst/
Rewrite for clarity; remove "context" tag
Line 1:
In [[computer science]], a '''suffix array''' is an [[array]] giving the [[suffix (computer science)|suffixes]] of a [[string (computer science)|string]] in [[lexicographical order]].
{{context}}
A '''suffix array''' is an [[index (information technology)|index structure]] for locating [[substring]]s in a larger [[string (computer science)|string]]. It is an array of indexes giving the [[lexicographical order]] of the [[suffix]]es. A suffix array for a string ''S'' of length ''n'' can be built in Θ(''n'') time, and lets you find all ''occ'' occurrences of a pattern ''P'' of length ''m'' in ''S'' in O(''m'' + log ''n'' + occ) time. Suffix arrays were originally developed by [[Gene Myers]] and [[Udi Manber]] to reduce memory consumption compared to a [[suffix tree]]. This began the trend towards compressed suffix arrays and [[BWT]]-based compressed full-text indices.
 
==Details==
The ''k''th suffix of a string ''S'' is ''S'' with the first ''k''-1 characters removed, for some starting position ''k''. A string of ''n'' characters has ''n'' suffixes, denoted by their starting positions 1..''n''. For instance, the string "abracadabra" has the following suffixes:
 
Consider the string "abracadabra", of length 11. It has eleven suffixes: "abracadabra", "bracadabra", "racadabra", and so on down to "a". Sorted into lexicographical order, these suffixes are
1 abracadabra
2 bracadabra
3 racadabra
4 acadabra
5 cadabra
6 adabra
7 dabra
8 abra
9 bra
10 ra
11 a
 
a
The first step in building a suffix array is to sort the suffixes lexicographically, giving:
8 abra
1 abracadabra
4 acadabra
6 adabra
bra
2 bracadabra
5 cadabra
7 dabra
10 ra
3 racadabra
 
If the original string is available, each suffix can be completely specified by the [[zero-based]] or [[one-based]] index of its first character. The suffix array is the array of these indices in lexicographical order. For the string "abracadabra", using one-based indexing, the suffix array is {11,8,1,4,6,9,2,5,7,10,3}, because the suffix "a" begins at the 11th character, "abra" begins at the 8th character, and so forth.
1 11 a
2 8 abra
3 1 abracadabra
4 4 acadabra
5 6 adabra
6 9 bra
7 2 bracadabra
8 5 cadabra
9 7 dabra
10 10 ra
11 3 racadabra
 
One may also treat the string "abracadabra" as having a twelfth suffix, the empty string (with index 12). But since the empty string will always sort lexicographically before every other suffix, we lose no information by omitting it.
To find where a given pattern ''P'' is a substring of ''S'', two [[Binary_search_algorithm|binary searches]] are used to find the range of suffixes [[Prefix (linguistics)|prefix]]ed by ''P''. The output of the search is the list of starting positions in this range. A search for "br" in the given example would give a left border 6 and a right border 7, giving suffixes 9 and 2. This means that "br" is a substring of "abracadabra" both at position 2 and 9. If implemented straightforward, this binary search takes O(''m'' log ''n'') time, as most of the pattern P is compared at every step in the binary search in the worst case. To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCP's) of suffixes are constructed, giving O(''m'' + log ''n'') search time.
 
==Algorithms==
The key insight of the suffix array is to denote each suffix by its starting position only (the second column above). The resulting array of numbers, combined with the original string, is a compact representation of the sorted suffix list, consuming one character and one integer for each character in the string.
 
The easiest way to construct a suffix array is to use an efficient [[comparison sort]] algorithm. This requires <math>O(n \log n)</math> suffix comparisons, but a suffix comparison requires <math>O(n)</math> time, so the overall runtime of this approach is <math>O(n^2 \log n)</math>. More sophisticated algorithms improve this to <math>O(n \log n)</math> by exploiting the results of partial sorts to avoid redundant comparisons. Several <math>\Theta(n)</math> algorithms are known, but are rarely used in practice because of complexity, high memory requirements, and inefficiency (high constant factors).
There are many suffix array construction algorithms, with different properties. Some O(''n''<sup>2</sup>) construction algorithms are faster than the &Theta;(n) algorithms in practise.
 
==Applications==
 
The suffix array of a string can be used as an [[index (information technology)|index]] to quickly locate every occurrence of a substring within the string. Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a [[binary search]]. If implemented straightforwardly, this binary search takes <math>O(m \log n)</math> time, where <math>m</math> is the length of the substring. To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCPs) of suffixes are constructed, giving <math>O(m + \log n)</math> search time.
 
Suffix sorting algorithms can be used to perform the [[Burrows-Wheeler transform]] (BWT). Technically the BWT requires sorting [[cyclic permutation]]s of a string, not suffixes. We can fix this by appending to the string a special end-of-string character which sorts lexicographically before every other character. Sorting cyclic permutations is then equivalent to sorting suffixes.
 
==History==
 
Suffix arrays were originally developed by [[Gene Myers]] and [[Udi Manber]] to reduce memory consumption compared to a [[suffix tree]]. This began the trend towards compressed suffix arrays and [[BWT]]-based compressed full-text indices.
 
==References==