LCP array: Difference between revisions

Content deleted Content added
Revised the lead section somewhat
m clean up, typo(s) fixed: Therefore → Therefore, (4), above mentioned → above-mentioned using AWB
Line 136:
 
* Case 1: k < lcp(M,M'), i.e. P has fewer prefix characters in common with M than M has in common with M'. This means the (k+1)-th character of M' is the same as that of M, and since P is lexicographically larger than M, it must be lexicographically larger than M', too. So we continue in the right half (M',...,R).
 
* Case 2: k > lcp(M,M'), i.e. P has more prefix characters in common with M than M has in common with M'. Consequently, if we were to compare P to M', the common prefix would be smaller than k, and M' would be lexicographically larger than P, so, without actually making the comparison, we continue in the left half (M,...,M').
 
* Case 3: k == lcp(M,M'). So M and M' are both identical with P in the first k characters. To decide whether we continue in the left or right half, it suffices to compare P to M' starting from the (k+1)-th character.
 
* We continue recursively.
 
Line 154 ⟶ 151:
 
* It is possible to compute LCP-LR in O(N) time and O(2*N)=O(N) space from LCP.
 
* Using LCP-LR during binary search helps accelerate the search procedure from O(M*log N) to O(M+log N).
 
* You can use two binary searches to determine the left and right end of the match range for P, and the length of the match range corresponds with the number of occurrences for P.
 
Line 175 ⟶ 170:
algorithm is a large space occupancy of <math>13n</math> bytes, while the
original output (text, suffix array, LCP array) only occupies <math>9n</math>
bytes. Therefore, {{harvtxt|Manzini|2004}} created a refined version of the algorithm of {{harvtxt|Kasai|Lee|Arimura|Arikawa|2001}} (lcp9) and reduced the
space occupancy to <math>9n</math> bytes. {{harvtxt|Kärkkäinen|Manzini|Puglisi|2009}} provide another refinement of
Kasai's algorithm (<math>\Phi</math>-algorithm) that improves the running
Line 182 ⟶ 177:
 
{{harvtxt|Gog|Ohlebusch|2011}} provide two algorithms that although being theoretically slow
(<math>O(n^2)</math>) were faster than the above -mentioned algorithms in
practice.
 
Line 198 ⟶ 193:
Deciding if a pattern <math>P</math> of length <math>m</math> is a substring of a string <math>S</math> of length <math>n</math> takes <math> O(m \log n)</math> time if only the suffix array is used. By additionally using the LCP information, this bound can be improved to <math>O(m + \log n)</math> time.{{sfn|Manber|Myers|1993}} {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} show how to improve this running time even further to achieve optimal <math>O(m)</math> time. Thus, using suffix array and LCP array information, the decision query can be answered as fast as using the [[suffix tree]].
 
The LCP array is also an essential part of compressed suffix trees which provide full suffix tree functionality like suffix links and [[lowest common ancestor]] queries.{{sfn|Sadakane|2007}}{{sfn|Fischer|Mäkinen|Navarro|2009}} Furthermore, it can be used together with the suffix array to compute the Lempel-Ziv [[LZ77 and LZ78|LZ77]] factorization in <math>O(n)</math> time. {{sfn|Abouelhoda|Kurtz|Ohlebusch|2004}}{{sfn|Crochemore|Ilie|2008}}{{sfn|Crochemore|Ilie|Smyth|2008}}{{sfn|Chen|Puglisi|Smyth|2008}}
 
The [[longest repeated substring problem]] for a string <math>S</math> of length <math>n</math> can be solved in <math>\Theta(n)</math> time using both the suffix array <math>A</math> and the LCP array. It is sufficient to perform a linear scan through the LCP array in order to find its maximum value <math>v_{max}</math> and the corresponding index <math>i</math> where <math>v_{max}</math> is stored. The longest substring that occurs at least twice is then given by <math>S[A[i],A[i]+v_{max}-1]</math>.
Line 213 ⟶ 208:
We need to distinguish two cases:
* <math>d(v)=H[i+1]</math>: This means that the concatenation of the labels on the root-to-<math>v</math> path equals the longest common prefix of suffixes <math>A[i]</math> and <math>A[i+1]</math>. <br /> In this case, insert <math>A[i+1]</math> as a new leaf <math>x</math> of node <math>v</math> and label the edge <math>(v,x)</math> with <math>S[A[i+1]+H[i+1],n]</math>. Thus the edge label consists of the remaining characters of suffix <math>A[i+1]</math> that are not already represented by the concatenation of the labels of the root-to-<math>v</math> path. <br />This creates the partial suffix tree <math>ST_{i+1}</math>. [[File:Constructing the suffix tree of banana based on the suffix array and the LCP array - Case 2.pdf|thumb|400px|Case 2 (<math>d(v) < H[i+1]</math>): In order to add suffix <math>nana$</math>, the edge to the previously inserted suffix <math>na$</math> has to be split up. The new edge to the new internal node is labeled with the longest common prefix of the suffixes <math>na$</math> and <math>nana$</math>. The edges connecting the two leaves are labeled with the ''remaining'' suffix characters that are not part of the prefix.]]
*<math>d(v) < H[i+1]</math>: This means that the concatenation of the labels on the root-to-<math>v</math> path displays less characters than the longest common prefix of suffixes <math>A[i]</math> and <math>A[i+1]</math> and the ''missing'' characters are contained in the edge label of <math>v</math>'s ''rightmost'' edge. Therefore, we have to ''split up'' that edge as follows: <br />Let <math>w</math> be the child of <math>v</math> on <math>ST_i</math>'s rightmost path.
 
*<math>d(v) < H[i+1]</math>: This means that the concatenation of the labels on the root-to-<math>v</math> path displays less characters than the longest common prefix of suffixes <math>A[i]</math> and <math>A[i+1]</math> and the ''missing'' characters are contained in the edge label of <math>v</math>'s ''rightmost'' edge. Therefore we have to ''split up'' that edge as follows: <br />Let <math>w</math> be the child of <math>v</math> on <math>ST_i</math>'s rightmost path.
# Delete the edge <math>(v,w)</math>.
# Add a new internal node <math>y</math> and a new edge <math>(v,y)</math> with label <math>S[A[i]+d(v),A[i]+H[i+1]-1]</math>. The new label consists of the ''missing'' characters of the longest common prefix of <math>A[i]</math> and <math>A[i+1]</math>. Thus, the concatenation of the labels of the root-to-<math>y</math> path now displays the longest common prefix of <math>A[i]</math> and <math>A[i+1]</math>.
Line 228 ⟶ 222:
The LCP array <math>H</math> only contains the length of the longest common prefix of every pair of consecutive suffixes in the suffix array <math>A</math>. However, with the help of the inverse suffix array <math>A^{-1}</math> (<math> A[i]= j \Leftrightarrow A^{-1}[j]= i </math>, i.e. the suffix <math>S[j,n]</math> that starts at position <math>j</math> in <math>S</math> is stored in position <math>A^{-1}[j]</math> in <math>A</math>) and constant-time [[Range Minimum Query|range minimum queries]] on <math>H</math>, it is possible to determine the length of the longest common prefix of arbitrary suffixes in <math>O(1)</math> time.
 
Because of the lexicographic order of the suffix array, every common prefix of the suffixes <math>S[i,n]</math> and <math>S[j,n]</math> has to be a common prefix of all suffixes between <math>i</math>'s position in the suffix array <math> A^{-1}[i]</math> and <math>j</math>'s position in the suffix array <math> A^{-1}[j] </math>. Therefore, the length of the longest prefix that is shared by ''all'' of these suffixes is the minimum value in the interval <math>H[A^{-1}[i]+1,A^{-1}[j]]</math>. This value can be found in constant time if <math>H</math> is preprocessed for range minimum queries.
 
Thus given a string <math>S</math> of length <math>n</math> and two arbitrary positions <math>i,j</math> in the string <math>S</math> with <math> A^{-1}[i] < A^{-1}[j] </math>, the length of the longest common prefix of the suffixes <math>S[i,n]</math> and <math>S[j,n]</math> can be computed as follows: <math>\operatorname{LCP}(i,j)=H[\operatorname{RMQ}_H(A^{-1}[i]+1,A^{-1}[j])]</math>.