Suffix array: Difference between revisions

Content deleted Content added
m Reverted edits by 2607:9880:4257:FFA8:3D65:F532:BCFB:36C0 (talk) to last version by JCW-CleanerBot
Added Enhanced Suffix Array
Line 223:
 
Many additional applications of the suffix array require the [[LCP array]]. Some of these are detailed in the [[LCP array#Applications|application section]] of the latter.
 
== Enhanced Suffix Arrays ==
Suffix trees are powerful data structures that have wide application in areas of pattern and string matching, indexing and textual statistics. However, it occupies a significant amount of space and thus has a drawback in many real-time applications that requires processing a considerably large amount of data like genome analysis. To overcome this drawback, Enhanced Suffix Arrays were developed that are data structures consisting of suffix arrays and an additional table called the child table that contains the information about the parent-child relationship between the nodes in the suffix tree. The node branching data structure for this tree is a linked list. Enhanced suffix trees are superior in terms of both space efficiency and time complexity and are easy to implement. Moreover, they can be applied to any algorithm that uses a suffix tree by using an abstract concept lcp-interval trees. The time complexity for searching a pattern in an enhanced suffix array is O(m|Σ|).
 
The suffix array of the string is an array of n integers in the range of 0 to n that represents the n+1 suffixes of the string including the special character #.
 
The suffix array is comprised of two arrays:
 
# pos array pos[1,...n]: It represents a sorted list of all S suffixes. Only the initial positions of the suffixes are stored in the array to reduce the space complexity since the suffixes are too large.
# lcp array lcp[1,...n]: It is an array of n integers that maintains the lengths of the longest common prefix of two consecutive suffixes stored in the pos array.
 
== Constructing the lcp-interval ==
For a suffix array of S, the lcp-interval associated with the corresponding node of suffix tree of S can be defined as:
 
''Interval [i,..j], 0 ≤ i ≤ j ≤ n is an lcp-interval of lcp-value, if''
 
''1. lcptab[i] < l,''
 
''2. lcptab[k] ≥ l for all i + 1 ≤ k ≤ j,''
 
''3. lcptab[k] = l for some i + 1 ≤ k ≤ j if i ≠ j and l = n − i + 1 if i = j,''
 
''4. lcptab[j + 1] < l.''
 
The length of the longest common prefix of pos[i − 1] and pos[i] is stored in lcp[i],where 2 ≤ i ≤ n. The lcp-interval portrays the same parent-child relationship as that among the associated nodes in the suffix tree of S.This shows that if the corresponding node of [i..j] is a child of the corresponding node of [k..l], a lcp-interval [i..j] is a child interval of another lcp-interval [k..l]. If [k..l] is a child interval of [i..j], a lcp-interval [i..j] is the parent interval of a lcp-interval [k..l].
 
== Constructing a Child Table ==
The child table ''cldtab'' is composed of three n arrays, ''up'', ''down'' and ''nextlIndex''.The information about the edges of the corresponding suffix tree is stored in maintained by the ''up'' and ''down'' array. The ''nextlIndexarray'' stores the links in the linked list used for node branching the suffix tree.
 
The ''up'', ''down'' and ''nextlIndex'' array are defined as follows:
 
# The element ''up[i]''records the starting index of the longest lcp-second interval’s child interval, which ends at index ''i-1''.
# The initial index of the second child interval of the longest lcp-interval, starting at index ''i'' is stored in the element ''down[i]''.
# If and only if the interval is neither the first child nor the final child of its parent, the element ''nextlIndex[i]'' contains the first index of the next sibling interval of the longest lcp-interval, starting at index ''i''.
 
By performing a bottom-up traversal of the lcp-interval of the tree, the child table can be constructed in linear time. The ''up/down'' values and the ''nextlIndex'' values can be computed separately by using two distinct algorithms.
 
= Constructing a Suffix Link Table =
The suffix links for an enhanced suffix array can be computed by generating the suffix link interval [''1,..,r''] for each [i,..j] interval during the preprocessing. The left and right elements l and r of the interval are maintained in the first index of [i,..,j]. The table for this interval ranges from 0 to n. The suffix link table is constructed by the left-to-right breadth-first traversal of the lcp-interval tree. Every time an ''l''-interval is computed, it is added to the list of l-intervals, which is referred to as the l-list. When the lcp-value > 0, for every ''l''-interval[i,..,j] in the list, link[i] is calculated. The interval [''l'',..,''r''] is computed by a binary search in(''l''-1)-list, where ''l'' is the largest left boundary amongst all the ''l''-1 intervals. The suffix link interval of [i,..j] is represented by this interval[''l,..,r'']. The values ''l'' and ''r'' are ultimately stored in the first index of [i,..,j].
 
== Notes ==
Line 330 ⟶ 369:
| doi = 10.1016/j.parco.2007.06.004|title=Scalable parallel suffix array construction|year=2007 |last1=Kulla|first1=Fabian|last2=Sanders|first2=Peter|author2-link=Peter Sanders (computer scientist) |journal=Parallel Computing |volume=33|issue=9
}}
*Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. "Replacing suffix trees with enhanced suffix arrays." ''Journal of Discrete Algorithms'', 2(1):53–86, 2004.
*Dong Kyue Kim, Jeong Eun Jeon, and Heejin Park. "An efficient index data structure with the capabilities of suffix trees and suffix arrays for alphabets of non-negligible size." ''String Processing and Information Retrieval Lecture Notes in Computer Science'', page138–149, 2004.
 
==External links==