LCP array: Difference between revisions

Content deleted Content added
Revised the lead section somewhat
Dr.Saxena (talk | contribs)
 
(39 intermediate revisions by 29 users not shown)
Line 1:
{{more footnotes needed|date=September 2012}}
{{Infobox
 
| above = LCP array
{| class="infobox" style="width: 22em"
| label1 = [[List of data structures|Type]]
! colspan="3" style="font-size: 125%; text-align: center;" | LCP array
| data1 = [[Array data structure|Array]]
|-
| label2 = Invented by
! [[List of data structures|Type]]
| data2 = {{harvtxt|Manber|Myers|1993}}
| colspan="2" | [[Array data structure|Array]]
| header3 = [[Time complexity]] and {{nowrap|[[space complexity]]}} {{nowrap|in [[big O notation]]}}
|-
| headerstyle = background:lavender
! Invented by
| data4 =
{{!}} colspan="2" {{!}} {{harvtxt|Manber|Myers|1990}}
{{aligned table|cols=3|row1header=y|col1header=y|fullwidth=y
|-
| | Average | Worst case
! colspan="3" class="navbox-abovebelow" | [[Time complexity]] and [[space complexity]]<br />in [[big O notation]]
<!-- -->
|-
| Space
|
| Average
| Worst case
|-
! Space
| <math>\mathcal{O}(n)</math>
| <math>\mathcal{O}(n)</math>
<!-- -->
|-
!| Construction
| <math>\mathcal{O}(n)</math>
| <math>\mathcal{O}(n)</math>
|}}
}}
In [[computer science]], the '''longest common prefix array''' ('''LCP [[Array data structure|array]]''') is an auxiliary [[data structure]] to the [[suffix array]]. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.
 
For example, if ''A'' := [<samp>aab</samp>, <samp>ab</samp>, <samp>abaab</samp>, <samp>b</samp>, <samp>baab</samp>] is a suffix array, the longest common prefix between ''A''[1] = <samp>aab</samp> and ''A''[2] = <samp>ab</samp> is <samp>''a''</samp> which has length 1, so ''H''[2] = 1 in the LCP array ''H''. Likewise, the LCP of ''A''[2] = <samp>ab</samp> and ''A''[3] = <samp>abaab</samp> is <samp>ab</samp>, so ''H''[3] = 2.
In [[computer science]], the '''longest common prefix array''' (LCP [[Array data structure|array]]) is an auxiliary [[data structure]] to the [[suffix array]]. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.
 
For example, if ''A'' := [<tt>aab</tt>, <tt>ab</tt>, <tt>abaab</tt>, <tt>b</tt>, <tt>baab</tt>] is a suffix array, the longest common prefix between ''A''[1] = <tt>aab</tt> and ''A''[2] = <tt>ab</tt> is <tt>a</tt> which has length 1, so ''H''[2] = 1 in the LCP array ''H''. Likewise, the LCP of ''A''[2] = <tt>ab</tt> and ''A''[3] = <tt>abaab</tt> is <tt>ab</tt>, so ''H''[3] = 2.
 
Augmenting the suffix array with the LCP array allows one to efficiently simulate top-down and bottom-up [[Tree traversal|traversals]] of the [[suffix tree]],{{sfn|Kasai|Lee|Arimura|Arikawa|2001}}{{sfn|Abouelhoda|Kurtz|Ohlebusch|2004}} speeds up pattern matching on the suffix array{{sfn|Manber|Myers|1993}} and is a prerequisite for compressed suffix trees.{{sfn|Ohlebusch|Fischer|Gog|2010}}
 
== History ==
The LCP array was introduced in 1993, by [[Udi Manber]] and [[Gene Myers]] alongside the suffix array in order to improve the running time of their string search algorithm.{{sfn|Manber|Myers|1993}}
 
== Definition ==
The LCP array was introduced in 1993 by [[Udi Manber]] and [[Gene Myers]] alongside the suffix array in order to improve the running time of their string search algorithm.{{sfn|Manber|Myers|1993}}
Let <math>A</math> be the [[suffix array]] of the string <math>S=s_1,s_2,\ldots s_{n-1}\$</math> of length <math>n</math>, where <math>\$</math> is a sentinel letter that is unique and [[Lexicographic order|lexicographically]] smaller than any other character. Let <math>S[i,j]</math> denote the substring of <math>S</math> ranging from <math>i</math> to <math>j</math>. Thus, <math>S[A[i], n]</math> is the <math>i</math>th smallest suffix of <math>S</math>.
 
Let <math>\operatorname{lcp}(v,w)</math> denote the length of the longest common prefix between two strings <math>v</math> and <math>w</math>. Then the LCP array <math>H[1,n]</math> is an integer array of size <math>n</math> such that <math>H[1]</math> is undefined and <math>H[i]=\operatorname{lcp}(S[A[i-1],n],S[A[i],n])</math> for every <math>1<i\leq n</math>. Thus <math>H[i]</math> stores the length of longest common prefix of the lexicographically <math>i</math>th smallest suffix and its predecessor in the suffix array.
[[Gene Myers]] was the former Vice president of Informatics Research at Celera Genomics, and [[Udi Manber]] was vice president of engineering at Google.
 
Difference between LCP array and suffix array:
== Definition ==
 
* Suffix array: Represents the lexicographic rank of each suffix of an array.
Let <math>A</math> be the suffix array of the string <math>S=s_1,s_2,...s_n$</math> and let <math>\operatorname{lcp}(v,w)</math> denote the length of the longest common prefix between two strings <math>v</math> and <math>w</math>. Let further denote <math>S[i,j]</math> the substring of <math>S</math> ranging from <math>i</math> to <math>j</math>.
* LCP array: Contains the maximum length prefix match between two consecutive suffixes, after they are sorted lexicographically.
 
Then the LCP array <math>H[1,n]</math> is an integer array of size <math>n</math> such that <math>H[1]</math> is undefined and <math>H[i]=\operatorname{lcp}(S[A[i-1],n],S[A[i],n])</math> for every <math>1<i\leq n</math>. Thus <math>H[i]</math> stores the length of longest common prefix of the [[Lexicographical order|lexicographically]] <math>i</math>'th smallest suffix and its predecessor in the suffix array.
 
== Example ==
Consider the string <math>S = \textrm{banana\$}</math>:
 
{| class="wikitable"
|-
Line 53 ⟶ 51:
| b || a || n || a || n || a || $
|}
and its corresponding sorted suffix array <math>A</math> :
{| class="wikitable"
! {{left header}} | i
Line 62 ⟶ 60:
|}
 
Complete suffixSuffix array with suffixes itselfwritten out underneath vertically:
 
{| class="wikitable"
Line 71 ⟶ 69:
| 7 || 6 || 4 || 2 || 1 || 5 || 3
|-
! {{left header}} | S[A[i], n][1]
| $ || a || a || a || b || n || n
|-
! {{left header}} | S[A[i], n][2]
| || $ || n || n || a || a || a
|-
! {{left header}} | S[A[i], n][3]
| || || a || a || n || $ || n
|-
! {{left header}} | S[A[i], n][4]
| || || $ || n || a || || a
|-
! {{left header}} | S[A[i], n][5]
| || || || a || n || || $
|-
! {{left header}} | S[A[i], n][6]
| || || || $ || a || ||
|-
! {{left header}} | S[A[i], n][7]
| || || || || $ || ||
 
Line 95 ⟶ 93:
 
Then the LCP array <math>H</math> is constructed by comparing lexicographically consecutive suffixes to determine their longest common prefix:
 
{| class="wikitable"
|-
Line 101 ⟶ 100:
|-
! {{left header}} | H[i]
| <math>\bot</math>undefined || 0 || 1 || 3 || 0 || 0 || 2
|}
 
So, for example, <math>H[4]=3</math> is the length of the longest common prefix <math>\text{ana}</math> shared by the suffixes <math> A[3] = S[4,7] = \textrm{ana\$} </math> and <math>A[4] = S[2,7] = \textrm{anana\$}</math>. Note that <math>H[1]=\bot</math> is undefined, since there is no lexicographically smaller suffix.
 
== Efficient construction algorithms ==
== Difference between suffix array and LCP array ==
LCP array construction algorithms can be divided into two different categories: algorithms that compute the LCP array as a byproduct to the suffix array and algorithms that use an already constructed suffix array in order to compute the LCP values.
 
{{harvtxt|Manber|Myers|1993}} provide an algorithm to compute the LCP array alongside the suffix array in <math>O(n \log n)</math> time. {{harvtxt|Kärkkäinen|Sanders|2003}} show that it is also possible to modify their <math>O(n)</math> time algorithm such that it computes the LCP array as well. {{harvtxt|Kasai|Lee|Arimura|Arikawa|2001}} present the first <math>O(n)</math> time algorithm (FLAAP) that computes the LCP array given the text and the suffix array.
Suffix array: Represents the lexicographic rank of each suffix of an array.
 
Assuming that each text symbol takes one byte and each entry of the suffix or LCP array takes 4 bytes, the major drawback of their algorithm is a large space occupancy of <math>13n</math> bytes, while the original output (text, suffix array, LCP array) only occupies <math>9n</math> bytes. Therefore, {{harvtxt|Manzini|2004}} created a refined version of the algorithm of {{harvtxt|Kasai|Lee|Arimura|Arikawa|2001}} (lcp9) and reduced the space occupancy to <math>9n</math> bytes. {{harvtxt|Kärkkäinen|Manzini|Puglisi|2009}} provide another refinement of Kasai's algorithm (<math>\Phi</math>-algorithm) that improves the running time. Rather than the actual LCP array, this algorithm builds the ''permuted'' LCP (PLCP) array, in which the values appear in text order rather than lexicographical order.
LCP array: Contains the maximum length prefix match between two consecutive suffixes, after they are sorted lexicographically.
 
{{harvtxt|Gog|Ohlebusch|2011}} provide two algorithms that although being theoretically slow (<math>O(n^2)</math>) were faster than the above-mentioned algorithms in practice.
== LCP array usage in finding the number of occurrences of a pattern ==
 
{{As of|2012}}, the currently fastest linear-time LCP array construction algorithm is due to {{harvtxt|Fischer|2011}}, which in turn is based on one of the fastest suffix array construction algorithms (SA-IS) by {{harvtxt|Nong|Zhang|Chan|2009}}. {{harvtxt|Fischer|Kurpicz|2017}} based on Yuta Mori's DivSufSort is even faster.
In order to find the number of occurrences of a given string P (length m) in a text T (length N),
 
== Applications ==
* You must use binary search against the suffix array of T.
As noted by {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} several string processing problems can be solved by the following kinds of [[tree traversal]]s:
* You should speed up the LCP array usage as an auxiliary data structure. More specifically, you generate a special version of the LCP array (LCP-LR below) and use that.
 
* bottom-up traversal of the complete suffix tree
The issue with using standard binary search (without the LCP information) is that in each of the O(log N) comparisons you need to make, you compare P to the current entry of the suffix array, which means a full string comparison of up to m characters. So the complexity is O(m*log N).
* top-down traversal of a subtree of the suffix tree
* suffix tree traversal using the suffix links.
 
{{harvtxt|Kasai|Lee|Arimura|Arikawa|2001}} show how to simulate a bottom-up traversal of the [[suffix tree]] using only the [[suffix array]] and LCP array. {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} enhance the suffix array with the LCP array and additional data structures and describe how this ''enhanced suffix array'' can be used to simulate ''all three kinds'' of suffix tree traversals. {{harvtxt|Fischer|Heun|2007}} reduce the space requirements of the enhanced suffix array by preprocessing the LCP array for [[Range Minimum Query|range minimum queries]]. Thus, ''every'' problem that can be solved by suffix tree algorithms can also be solved using the ''enhanced suffix array''.{{sfn|Abouelhoda|Kurtz|Ohlebusch|2004}}
The LCP-LR array helps improve this to O(m+log N), in the following way:
 
Deciding if a pattern <math>P</math> of length <math>m</math> is a substring of a string <math>S</math> of length <math>n</math> takes <math> O(m \log n)</math> time if only the suffix array is used. By additionally using the LCP information, this bound can be improved to <math>O(m + \log n)</math> time.{{sfn|Manber|Myers|1993}} {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} show how to improve this running time even further to achieve optimal <math>O(m)</math> time. Thus, using suffix array and LCP array information, the decision query can be answered as fast as using the [[suffix tree]].
At any point during the binary search algorithm, you consider, as usual, a range (L,...,R) of the suffix array and its central point M, and decide whether you continue your search in the left sub-range (L,...,M) or in the right sub-range (M,...,R).
In order to make the decision, you compare P to the string at M. If P is identical to M, you are done, but if not, you will have compared the first k characters of P and then decided whether P is lexicographically smaller or larger than M. Let's assume the outcome is that P is larger than M.
So, in the next step, you consider (M,...,R) and a new central point M' in the middle:
 
The LCP array is also an essential part of compressed suffix trees which provide full suffix tree functionality like suffix links and [[lowest common ancestor]] queries.{{sfn|Sadakane|2007}}{{sfn|Fischer|Mäkinen|Navarro|2009}} Furthermore, it can be used together with the suffix array to compute the Lempel-Ziv [[LZ77 and LZ78|LZ77]] factorization in <math>O(n)</math> time.{{sfn|Abouelhoda|Kurtz|Ohlebusch|2004}}{{sfn|Crochemore|Ilie|2008}}{{sfn|Crochemore|Ilie|Smyth|2008}}{{sfn|Chen|Puglisi|Smyth|2008}}
M ...... M' ...... R
|
we know:
lcp(P,M)==k
The trick now is that LCP-LR is precomputed such that an O(1)-lookup tells you the longest common prefix of M and M', lcp(M,M').
 
The [[longest repeated substring problem]] for a string <math>S</math> of length <math>n</math> can be solved in <math>\Theta(n)</math> time using both the suffix array <math>A</math> and the LCP array. It is sufficient to perform a linear scan through the LCP array in order to find its maximum value <math>v_{max}</math> and the corresponding index <math>i</math> where <math>v_{max}</math> is stored. The longest substring that occurs at least twice is then given by <math>S[A[i],A[i]+v_{max}-1]</math>.
You know already (from the previous step) that M itself has a prefix of k characters in common with P: lcp(P,M)=k. Now there are three possibilities:
 
The remainder of this section explains two applications of the LCP array in more detail: How the suffix array and the LCP array of a string can be used to construct the corresponding suffix tree and how it is possible to answer LCP queries for arbitrary suffixes using range minimum queries on the LCP array.
* Case 1: k < lcp(M,M'), i.e. P has fewer prefix characters in common with M than M has in common with M'. This means the (k+1)-th character of M' is the same as that of M, and since P is lexicographically larger than M, it must be lexicographically larger than M', too. So we continue in the right half (M',...,R).
 
=== Find the number of occurrences of a pattern ===
* Case 2: k > lcp(M,M'), i.e. P has more prefix characters in common with M than M has in common with M'. Consequently, if we were to compare P to M', the common prefix would be smaller than k, and M' would be lexicographically larger than P, so, without actually making the comparison, we continue in the left half (M,...,M').
{{Cleanup|reason=this section is a straight-up copy of a [https://stackoverflow.com/a/11374737 StackOverflow answer] so it has the form of a reply to a question.|date=June 2016}}
 
In order to find the number of occurrences of a given string <math>P</math> (length <math>m</math>) in a text <math>T</math> (length <math>N</math>),{{sfn|Manber|Myers|1993}}
* Case 3: k == lcp(M,M'). So M and M' are both identical with P in the first k characters. To decide whether we continue in the left or right half, it suffices to compare P to M' starting from the (k+1)-th character.
 
* We use binary search against the suffix array of <math>T</math> to find the starting and end position of all occurrences of <math>P</math>.
* We continue recursively.
* Now to speed up the search, we use LCP array, specifically a special version of the LCP array (LCP-LR below).
 
The overallissue effectwith using standard binary search (without the LCP information) is that noin charactereach of Pthe is<math>O(\log comparedN)</math> comparisons needed to anybe charactermade, ofwe compare P to the textcurrent moreentry thanof once.the Thesuffix totalarray, numberwhich ofmeans charactera comparisonsfull isstring boundedcomparison byof up to m, socharacters. So the total complexity is indeed <math>O(m+\log N)</math>.
 
The LCP-LR array helps improve this to <math>O(m+\log N)</math>, in the following way:
Obviously, the key remaining question is how did we precompute LCP-LR so it is able to tell us in O(1) time the lcp between any two entries of the suffix array? As you said, the standard LCP array tells you the lcp of consecutive entries only, i.e. lcp(x-1,x) for any x. But M and M' in the description above are not necessarily consecutive entries, so how is that done?
 
At any point during the binary search algorithm, we consider, as usual, a range <math>(L,\dots,R)</math> of the suffix array and its central point <math>M</math>, and decide whether we continue our search in the left sub-range <math>(L,\dots,M)</math> or in the right sub-range <math>(M,\dots,R)</math>. In order to make the decision, we compare <math>P</math> to the string at <math>M</math>. If <math>P</math> is identical to <math>M</math>, our search is complete. But if not, we have already compared the first <math>k</math> characters of <math>P</math> and then decided whether <math>P</math> is lexicographically smaller or larger than <math>M</math>. Let's assume the outcome is that <math>P</math> is larger than <math>M</math>. So, in the next step, we consider <math>(M,\dots,R)</math> and a new central point <math>M'</math> in the middle:
The key to this is to realize that only certain ranges (L,...,R) will ever occur during the binary search: It always starts with (0,...,N) and divides that at the center, and then continues either left or right and divide that half again and so forth. If you think of it: Every entry of the suffix array occurs as central point of exactly one possible range during binary search. So there are exactly N distinct ranges (L...M...R) that can possibly play a role during binary search, and it suffices to precompute lcp(L,M) and lcp(M,R) for those N possible ranges. So that is 2*N distinct precomputed values, hence LCP-LR is O(N) in size.
 
M ...... M' ...... R
Moreover, there is a straightforward recursive algorithm to compute the 2*N values of LCP-LR in O(N) time from the standard LCP array – I'd suggest posting a separate question if you need a detailed description of that.
|
we know:
lcp(P,M)==k
 
The trick now is that LCP-LR is precomputed such that an <math>O(1)</math>-lookup tells us the longest common prefix of <math>M</math> and <math>M'</math>, <math>\mathrm{lcp}(M,M')</math>.
To sum up:
 
We already know (from the previous step) that <math>M</math> itself has a prefix of <math>k</math> characters in common with <math>P</math>: <math>\mathrm{lcp}(P,M)=k</math>. Now there are three possibilities:
* It is possible to compute LCP-LR in O(N) time and O(2*N)=O(N) space from LCP.
 
* Case 1: <math>k < \mathrm{lcp}(M,M')</math>, i.e. <math>P</math> has fewer prefix characters in common with M than M has in common with M'. This means the (k+1)-th character of M' is the same as that of M, and since P is lexicographically larger than M, it must be lexicographically larger than M', too. So we continue in the right half (M',...,R).
* Using LCP-LR during binary search helps accelerate the search procedure from O(M*log N) to O(M+log N).
* Case 2: <math>k > \mathrm{lcp}(M,M')</math>, i.e. <math>P</math> has more prefix characters in common with <math>M</math> than <math>M</math> has in common with <math>M'</math>. Consequently, if we were to compare <math>P</math> to <math>M'</math>, the common prefix would be smaller than <math>k</math>, and <math>M'</math> would be lexicographically larger than <math>P</math>, so, without actually making the comparison, we continue in the left half <math>(M,\dots,M')</math>.
* Case 3: <math>k = \mathrm{lcp}(M,M')</math>. So M and M' are both identical with <math>P</math> in the first <math>k</math> characters. To decide whether we continue in the left or right half, it suffices to compare <math>P</math> to <math>M'</math> starting from the <math>(k+1)</math>th character.
* We continue recursively.
 
The overall effect is that no character of <math>P</math> is compared to any character of the text more than once (for details see {{sfn|Manber|Myers|1993}}). The total number of character comparisons is bounded by <math>m</math>, so the total complexity is indeed <math>O(m+\log N)</math>.
* You can use two binary searches to determine the left and right end of the match range for P, and the length of the match range corresponds with the number of occurrences for P.
 
We still need to precompute LCP-LR so it is able to tell us in <math>O(1)</math> time the lcp between any two entries of the suffix array. We know the standard LCP array gives us the lcp of consecutive entries only, i.e. <math>\mathrm{lcp}(i-1,i)</math> for any <math>i</math>. However, <math>M</math> and <math>M'</math> in the description above are not necessarily consecutive entries.
== Efficient construction algorithms ==
LCP array construction algorithms can be divided into two different
categories: algorithms that compute the LCP array as a byproduct to the suffix
array and algorithms that use an already constructed suffix array in order to
compute the LCP values.
 
The key to this is to realize that only certain ranges <math>(L,\dots,R)</math> will ever occur during the binary search: It always starts with <math>(0, \dots,N)</math> and divides that at the center, and then continues either left or right and divide that half again and so forth. Another way of looking at it is : every entry of the suffix array occurs as central point of exactly one possible range during binary search. So there are exactly N distinct ranges <math>(L\dots M\dots R)</math> that can possibly play a role during binary search, and it suffices to precompute <math>\mathrm{lcp}(L,M)</math> and <math>\mathrm{lcp}(M,R)</math> for those <math>N</math> possible ranges. So that is <math>2N</math> distinct precomputed values, hence LCP-LR is <math>O(N)</math> in size.
{{harvtxt|Manber|Myers|1993}} provide an algorithm to compute the LCP array alongside the suffix array
in <math>O(n \log n)</math> time. {{harvtxt|Kärkkäinen|Sanders|2003}} show that it is also possible to modify their
<math>O(n)</math> time algorithm such that it computes the LCP array as well.
{{harvtxt|Kasai|Lee|Arimura|Arikawa|2001}} present the first <math>O(n)</math> time algorithm (FLAAP) that computes the LCP
array given the text and the suffix array.
 
Moreover, there is a straightforward recursive algorithm to compute the <math>2N</math> values of LCP-LR in <math>O(N)</math> time from the standard LCP array.
Assuming that each text symbol takes
one byte and each entry of the suffix or LCP array takes 4 bytes, the major drawback of their
algorithm is a large space occupancy of <math>13n</math> bytes, while the
original output (text, suffix array, LCP array) only occupies <math>9n</math>
bytes. Therefore {{harvtxt|Manzini|2004}} created a refined version of the algorithm of {{harvtxt|Kasai|Lee|Arimura|Arikawa|2001}} (lcp9) and reduced the
space occupancy to <math>9n</math> bytes. {{harvtxt|Kärkkäinen|Manzini|Puglisi|2009}} provide another refinement of
Kasai's algorithm (<math>\Phi</math>-algorithm) that improves the running
time. Rather than the actual LCP array, this algorithm builds the ''permuted''
LCP (PLCP) array, in which the values appear in text order rather than lexicographical order.
 
To sum up:
{{harvtxt|Gog|Ohlebusch|2011}} provide two algorithms that although being theoretically slow
(<math>O(n^2)</math>) were faster than the above mentioned algorithms in
practice.
 
* It is possible to compute LCP-LR in <math>O(N)</math> time and <math>O(2N)=O(N)</math> space from LCP.
As of 2012, the currently fastest linear-time LCP array construction algorithm is due to {{harvtxt|Fischer|2011}},
* Using LCP-LR during binary search helps accelerate the search procedure from <math>O(M\log N)</math> to <math>O(M+\log N)</math>.
which in turn is based on one of the fastest suffix array construction algorithms by {{harvtxt|Nong|Zhang|Chan|2009}}.
* We can use two binary searches to determine the left and right end of the match range for <math>P</math>, and the length of the match range corresponds with the number of occurrences for P.
 
=== Suffix tree construction ===
== Applications ==
Given the suffix array <math>A</math> and the LCP array <math>H</math> of a string <math>S=s_1,s_2,\ldots s_n\$</math> of length <math>n+1</math>, its suffix tree <math>ST</math> can be constructed in <math>O(n)</math> time based on the following idea: Start with the partial suffix tree for the lexicographically smallest suffix and repeatedly insert the other suffixes in the order given by the suffix array.
As noted by {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} several string processing problems can be solved by the following kinds of [[tree traversal]]s:
* bottom-up traversal of the complete suffix tree
* top-down traversal of a subtree of the suffix tree
* suffix tree traversal using the suffix links.
 
Let <math>ST_{i}</math> be the partial suffix tree for <math>0\leq i \leq n</math>. Further let <math>d(v)</math> be the length of the concatenation of all path labels from the root of <math>ST_i</math> to node <math>v</math>.
{{harvtxt|Kasai|Lee|Arimura|Arikawa|2001}} show how to simulate a bottom-up traversal of the [[suffix tree]] using only the [[suffix array]] and LCP array. {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} enhance the suffix array with the LCP array and additional data structures and describe how this ''enhanced suffix array'' can be used to simulate ''all three kinds'' of suffix tree traversals. {{harvtxt|Fischer|Heun|2007}} reduce the space requirements of the enhanced suffix array by preprocessing the LCP array for [[Range Minimum Query|range minimum queries]]. Thus, ''every'' problem that can be solved by suffix tree algorithms can also be solved using the ''enhanced suffix array''.{{sfn|Abouelhoda|Kurtz|Ohlebusch|2004}}
 
[[File:Constructing the suffix tree of banana based on the suffix array and the LCP array - Case 1.pdf|thumb|400px|Case 1 (<math>d(v)=H[i+1]</math>): Suppose the suffixes <math>a\$</math>, <math>ana\$</math>, <math>anana\$</math> and <math>banana$</math> of the string <math>S=banana$</math> are already added to the suffix tree. Then the suffix <math>na\$</math> is added to the tree as shown in the picture. The ''rightmost'' path is highlighted in red.]]
Deciding if a pattern <math>P</math> of length <math>m</math> is a substring of a string <math>S</math> of length <math>n</math> takes <math> O(m \log n)</math> time if only the suffix array is used. By additionally using the LCP information, this bound can be improved to <math>O(m + \log n)</math> time.{{sfn|Manber|Myers|1993}} {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} show how to improve this running time even further to achieve optimal <math>O(m)</math> time. Thus, using suffix array and LCP array information, the decision query can be answered as fast as using the [[suffix tree]].
 
The LCP array is also an essential part of compressed suffix trees which provide full suffix tree functionality like suffix links and [[lowest common ancestor]] queries.{{sfn|Sadakane|2007}}{{sfn|Fischer|Mäkinen|Navarro|2009}} Furthermore it can be used together with the suffix array to compute the Lempel-Ziv [[LZ77 and LZ78|LZ77]] factorization in <math>O(n)</math> time. {{sfn|Abouelhoda|Kurtz|Ohlebusch|2004}}{{sfn|Crochemore|Ilie|2008}}{{sfn|Crochemore|Ilie|Smyth|2008}}{{sfn|Chen|Puglisi|Smyth|2008}}
 
The [[longest repeated substring problem]] for a string <math>S</math> of length <math>n</math> can be solved in <math>\Theta(n)</math> time using both the suffix array <math>A</math> and the LCP array. It is sufficient to perform a linear scan through the LCP array in order to find its maximum value <math>v_{max}</math> and the corresponding index <math>i</math> where <math>v_{max}</math> is stored. The longest substring that occurs at least twice is then given by <math>S[A[i],A[i]+v_{max}-1]</math>.
 
The remainder of this section explains two applications of the LCP array in more detail: How the suffix array and the LCP array of a string can be used to construct the corresponding suffix tree and how it is possible to answer LCP queries for arbitrary suffixes using range minimum queries on the LCP array.
 
=== Suffix tree construction ===
Given the suffix array <math>A</math> and the LCP array <math>H</math> of a string <math>S=s_1,s_2,...s_n$</math> of length <math>n+1</math>, its suffix tree <math>ST</math> can be constructed in <math>O(n)</math> time based on the following idea: Start with the partial suffix tree for the lexicographically smallest suffix and repeatedly insert the other suffixes in the order given by the suffix array.
 
Let <math>ST_{i}</math> be the partial suffix tree for <math>0\leq i \leq n</math>. Further let <math>d(v)</math> be the length of the concatenation of all path labels from the root of <math>ST_i</math> to node <math>v</math>.
[[File:Constructing the suffix tree of banana based on the suffix array and the LCP array - Case 1.pdf|thumb|400px|Case 1 (<math>d(v)=H[i+1]</math>): Suppose the suffixes <math>a$</math>, <math>ana$</math>, <math>anana$</math> and <math>banana$</math> of the string <math>S=banana$</math> are already added to the suffix tree. Then the suffix <math>na$</math> is added to the tree as shown in the picture. The ''rightmost'' path is highlighted in red.]]
Start with <math>ST_0</math>, the tree consisting only of the root. To insert <math>A[i+1]</math> into <math>ST_i</math>, walk up the ''rightmost'' path beginning at the recently inserted leaf <math>A[i]</math> to the root, until the deepest node <math>v</math> with <math>d(v) \leq H[i+1]</math> is reached.
 
We need to distinguish two cases:
* <math>d(v)=H[i+1]</math>: This means that the concatenation of the labels on the root-to-<math>v</math> path equals the longest common prefix of suffixes <math>A[i]</math> and <math>A[i+1]</math>. <br /> In this case, insert <math>A[i+1]</math> as a new leaf <math>x</math> of node <math>v</math> and label the edge <math>(v,x)</math> with <math>S[A[i+1]+H[i+1],n]</math>. Thus the edge label consists of the remaining characters of suffix <math>A[i+1]</math> that are not already represented by the concatenation of the labels of the root-to-<math>v</math> path. <br />This creates the partial suffix tree <math>ST_{i+1}</math>. [[File:Constructing the suffix tree of banana based on the suffix array and the LCP array - Case 2.pdf|thumb|400px|Case 2 (<math>d(v) < H[i+1]</math>): In order to add suffix <math>nana$</math>, the edge to the previously inserted suffix <math>na$</math> has to be split up. The new edge to the new internal node is labeled with the longest common prefix of the suffixes <math>na$</math> and <math>nana$</math>. The edges connecting the two leaves are labeled with the ''remaining'' suffix characters that are not part of the prefix.]]
 
* <math>d(v) < =H[i+1]</math>: This means that the concatenation of the labels on the root-to-<math>v</math> path displays less characters thanequals the longest common prefix of suffixes <math>A[i]</math> and <math>A[i+1]</math>. <br /> In this case, insert <math>A[i+1]</math> as a new leaf <math>x</math> of node <math>v</math> and label the ''missing''edge characters<math>(v,x)</math> arewith contained<math>S[A[i+1]+H[i+1],n]</math>. inThus the edge label consists of the remaining characters of suffix <math>vA[i+1]</math>'s ''rightmost''that edge.are Thereforenot wealready haverepresented by the concatenation of the labels of the root-to-<math>v</math> path. ''split<br up''/>This thatcreates edgethe aspartial followssuffix tree <math>ST_{i+1}</math>. [[File:Constructing the suffix tree of banana based on the suffix array and the LCP array - Case 2.pdf|thumb|400px|Case 2 (<brmath>d(v) < H[i+1]</math>Let): In order to add suffix <math>wnana\$</math>, the edge to the previously inserted suffix <math>na\$</math> has to be split up. The new edge to the childnew internal node is labeled with the longest common prefix of the suffixes <math>vna$</math> onand <math>ST_inana\$</math>. The edges connecting the two leaves are labeled with the ''remaining''s rightmostsuffix pathcharacters that are not part of the prefix.]]
* <math>d(v) < H[i+1]</math>: This means that the concatenation of the labels on the root-to-<math>v</math> path displays less characters than the longest common prefix of suffixes <math>A[i]</math> and <math>A[i+1]</math> and the ''missing'' characters are contained in the edge label of <math>v</math>'s ''rightmost'' edge. Therefore, we have to ''split up'' that edge as follows: <br />Let <math>w</math> be the child of <math>v</math> on <math>ST_i</math>'s rightmost path.
 
# Delete the edge <math>(v,w)</math>.
# Add a new internal node <math>y</math> and a new edge <math>(v,y)</math> with label <math>S[A[i]+d(v),A[i]+H[i+1]-1]</math>. The new label consists of the ''missing'' characters of the longest common prefix of <math>A[i]</math> and <math>A[i+1]</math>. Thus, the concatenation of the labels of the root-to-<math>y</math> path now displays the longest common prefix of <math>A[i]</math> and <math>A[i+1]</math>.
Line 228 ⟶ 202:
The LCP array <math>H</math> only contains the length of the longest common prefix of every pair of consecutive suffixes in the suffix array <math>A</math>. However, with the help of the inverse suffix array <math>A^{-1}</math> (<math> A[i]= j \Leftrightarrow A^{-1}[j]= i </math>, i.e. the suffix <math>S[j,n]</math> that starts at position <math>j</math> in <math>S</math> is stored in position <math>A^{-1}[j]</math> in <math>A</math>) and constant-time [[Range Minimum Query|range minimum queries]] on <math>H</math>, it is possible to determine the length of the longest common prefix of arbitrary suffixes in <math>O(1)</math> time.
 
Because of the lexicographic order of the suffix array, every common prefix of the suffixes <math>S[i,n]</math> and <math>S[j,n]</math> has to be a common prefix of all suffixes between <math>i</math>'s position in the suffix array <math> A^{-1}[i]</math> and <math>j</math>'s position in the suffix array <math> A^{-1}[j] </math>. Therefore, the length of the longest prefix that is shared by ''all'' of these suffixes is the minimum value in the interval <math>H[A^{-1}[i]+1,A^{-1}[j]]</math>. This value can be found in constant time if <math>H</math> is preprocessed for range minimum queries.
 
Thus given a string <math>S</math> of length <math>n</math> and two arbitrary positions <math>i,j</math> in the string <math>S</math> with <math> A^{-1}[i] < A^{-1}[j] </math>, the length of the longest common prefix of the suffixes <math>S[i,n]</math> and <math>S[j,n]</math> can be computed as follows: <math>\operatorname{LCP}(i,j)=H[\operatorname{RMQ}_H(A^{-1}[i]+1,A^{-1}[j])]</math>.
 
== Notes ==
{{Reflist}}
 
== References ==
*{{cite journal| doi=10.1016/S1570-8667(03)00065-0 |ref=harv| title=Replacing suffix trees with enhanced suffix arrays| year=2004| last1=Abouelhoda| first1=Mohamed Ibrahim| last2=Kurtz| first2=Stefan| last3=Ohlebusch| first3=Enno| journal=Journal of Discrete Algorithms| volume=2| pages=5353–86 | doi-access=free}}
*{{cite journal| doi=10.1137/0222058 | ref=harv| title=Suffix Arrays: A New Method for On-Line String Searches| year=1993| last1=Manber| first1=Udi| last2=Myers| first2=Gene| journal=SIAM Journal on Computing| volume=22| issue=5| pages=935| citeseerx=10.1.1.105.6571| s2cid=5074629}}
*{{cite conference |doi = 10.1007/3-540-48194-X_17 |ref=harv| conference = Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching | year = 2001 | last1 = Kasai | first1 = T. | last2 = Lee | first2 = G. | last3 = Arimura | first3 = H. | last4 = Arikawa | first4 = S. | last5 = Park | first5 = K. | title = Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications | series = Lecture Notes in Computer Science | volume = 2089 | pages = 181–192 | isbn = 978-3-540-42271-6
}}
*{{cite conference| doi=10.1007/978-3-642-16321-0_34 |ref=harv| title=CST++| conference=String Processing and Information Retrieval| series=Lecture Notes in Computer Science| year=2010| last1=Ohlebusch| first1=Enno| last2=Fischer| first2=Johannes| last3=Gog| first3=Simon| isbn=978-3-642-16320-3| volume=6393| pages=322}}
*{{cite conference|ref title =harv Simple linear work suffix array construction
| title = Simple linear work suffix array construction
| url = http://dl.acm.org/citation.cfm?id=1759210.1759301
| year = 2003
Line 249 ⟶ 222:
| last1 = Kärkkäinen | first1 = Juha
| last2 = Sanders | first2 = Peter | author2-link = Peter Sanders (computer scientist)
| accessdateaccess-date = 2012-08-28 }}
* {{cite conference| doi = 10.1007/978-3-642-22300-6_32 |ref=harv| last1 = Fischer | first1 = Johannes | title = Inducing the LCP-Array | conference = Algorithms and Data Structures | series = Lecture Notes in Computer Science | volume = 6844 | pages = 374–385 | year = 2011 | isbn = 978-3-642-22299-3 | arxiv = 1101.3448 }}
*{{cite conference| doi=10.1007/978-3-540-27810-8_32 |ref=harv| title=Two Space Saving Tricks for Linear Time LCP Array Computation| conference=Algorithm Theory - SWAT 2004| series=Lecture Notes in Computer Science| year=2004| last1=Manzini| first1=Giovanni| isbn=978-3-540-22339-9| volume=3111| pages=372}}
*{{cite conference| doi=10.1007/978-3-642-02441-2_17 |ref=harv| title=Permuted Longest-Common-Prefix Array| conference=Combinatorial Pattern Matching| series=Lecture Notes in Computer Science| year=2009| last1=Kärkkäinen| first1=Juha| last2=Manzini| first2=Giovanni| last3=Puglisi| first3=Simon J.| isbn=978-3-642-02440-5| volume=5577| pages=181}}
*{{cite conference| doi=10.1007/978-3-540-92182-0_14 |ref=harv| title=Space-Time Tradeoffs for Longest-Common-Prefix Array Computation| conference=Algorithms and Computation| series=Lecture Notes in Computer Science| year=2008| last1=Puglisi| first1=Simon J.| last2=Turpin| first2=Andrew| isbn=978-3-540-92181-3| volume=5369| pages=124}}
*{{cite conference|ref title =harv Fast and Lightweight LCP-Array Construction Algorithms
| title = Fast and Lightweight LCP-Array Construction Algorithms
| url = http://www.siam.org/proceedings/alenex/2011/alx11_03_gogs.pdf
| year = 2011
Line 262 ⟶ 234:
| last1 = Gog | first1 = Simon
| last2 = Ohlebusch | first2 = Enno
| accessdateaccess-date = 2012-08-28 }}
*{{cite conference| doi=10.1109/DCC.2009.42 | ref=harv| title=Linear Suffix Array Construction by Almost Pure Induced-Sorting| conference=2009 Data Compression Conference| year=2009| last1=Nong| first1=Ge| last2=Zhang| first2=Sen| last3=Chan| first3=Wai Hong| isbn=978-0-7695-3592-0| pages=193}}
*{{cite conference | doi = 10.1007/978-3-540-74450-4_41 | ref=harv| last1 = Fischer | first1 = Johannes
| last2 = Heun | first2 = Volker
| title = A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array
Line 274 ⟶ 246:
| isbn = 978-3-540-74449-8
}}
*{{Cite journal | doi = 10.1007/s11786-007-0024-4 | ref= harv | last1 = Chen | first1 = G. | last2 = Puglisi | first2 = S. J. | last3 = Smyth | first3 = W. F. | title = Lempel–Ziv Factorization Using Less Time & Space | journal = Mathematics in Computer Science | volume = 1 | issue = 4 | pages = 605 | year = 2008 | s2cid = 1721891 }}
*{{Cite journal | doi = 10.1016/j.ipl.2007.10.006 | ref=harv | last1 = Crochemore | first1 = M. | last2 = Ilie | first2 = L. | title = Computing Longest Previous Factor in linear time and applications | journal = Information Processing Letters | volume = 106 | issue = 2 | pages = 75 | year = 2008 | citeseerx=10.1.1.70.5720 | s2cid = 5492217 }}
*{{Cite conference | doi = 10.1109/DCC.2008.36 | ref=harv | last1 = Crochemore | first1 = M. | last2 = Ilie | first2 = L. | last3 = Smyth | first3 = W. F. | title = A Simple Algorithm for Computing the Lempel Ziv Factorization | conference = Data Compression Conference (dcc 2008) | pages = 482 | year = 2008 | isbn = 978-0-7695-3121-2 | hdl = 20.500.11937/5907 | hdl-access = free }}
*{{Cite journal| doi = 10.1007/s00224-006-1198-x |ref=harv| last1 = Sadakane | first1 = K. | title = Compressed Suffix Trees with Full Functionality | journal = Theory of Computing Systems | volume = 41 | issue = 4 | pages = 589–607 | year = 2007 |citeseerx=10.1.1.224.4152| s2cid = 263130 }}
*{{Cite journal | doi = 10.1016/j.tcs.2009.09.012 | ref=harv | last1 = Fischer | first1 = Johannes | last2 = Mäkinen | first2 = Veli | last3 = Navarro | first3 = Gonzalo | title = Faster entropy-bounded compressed suffix trees | journal = Theoretical Computer Science | volume = 410 | issue = 51 | pages = 5354 | year = 2009 | doi-access = free }}
*{{cite journal |last1=Fischer |first1=Johannes |last2=Kurpicz |first2=Florian |title=Dismantling DivSufSort |arxiv=1710.01896 |date=5 October 2017 |journal=Proceedings of the Prague Stringology Conference 2017}}
 
== External links ==
{{Commons category}}
*[https://github.com/elventear/sais-lite-lcp Mirror of the ad-hoc-implementation of the code described in {{harvtxt|Fischer|2011}}]
*[https://github.com/elventear/sais-lite-lcp Mirror of the ad-hoc-implementation of the code] described in {{harvtxt|Fischer|2011}}
*[https://github.com/simongog/sdsl/ SDSL: Succinct Data Structure Library - Provides various LCP array implementations, Range Minimum Query (RMQ) support structures and many more succinct data structures ]
*[https://github.com/carrotsearch/jsuffixarrays/blob/master/src/main/java/org/jsuffixarrays/Traversals.java Bottom-up suffix tree traversal emulated using suffix array and LCP array (Java)]
*[httphttps://code.google.com/p/text-indexing/ Text-Indexing project] (linear-time construction of suffix trees, suffix arrays, LCP array and [[Burrows-WheelerBurrows–Wheeler Transform]])
 
[[Category:Arrays]]