Content deleted Content added
remove the problem |
Undid revision 1292155026 by Daniel.Cardenas (talk): fits in the schem described at Dynamic_programming#Computer_science |
||
(29 intermediate revisions by 17 users not shown) | |||
Line 1:
{{Short description|Computer science problem}}
{{Wikibooks |Algorithm Implementation/Strings/Longest common substring}}
{{Distinguish|longest common subsequence}}
In [[computer science]], a '''longest common substring''' of two or more strings is
==Examples==
[[File:LongestSubstring svg.svg|thumb|The strings "BADANAT" and "CANADAS" share the maximal-length substrings "ADA" and "ANA".]]
The picture shows two strings where the problem has multiple solutions. Although the substring occurrences always overlap,
The strings "ABABC", "BABCA"
ABABC
Line 18 ⟶ 19:
Given two strings, <math>S</math> of length <math>m</math> and <math>T</math> of length <math>n</math>, find a longest string which is substring of both <math>S</math> and <math>T</math>.
A generalization is the
==Algorithms==
One can find the lengths and starting positions of the longest common substrings of <math>S</math> and <math>T</math> in [[Big O notation#Family of Bachmann–Landau notations|<math>\Theta</math>]]<math>(n+m)</math> time with the help of a [[generalized suffix tree]]. A faster algorithm can be achieved in the [[word RAM]] model of computation if the size <math>\sigma</math> of the input alphabet is in <math>2^{o \left( \sqrt{\log(n+m)} \right) }</math>. In particular, this algorithm runs in <math display=inline>O\left( (n+m) \log \sigma/\sqrt{\log (n+m)} \right) </math> time using <math>O\left((n+m)\log\sigma/\log (n+m) \right)</math> space.<ref name="CKPR21">{{cite conference | last1 = Charalampopoulos | first1 = Panagiotis | last2 = Kociumaka | first2 = Tomasz | last3 = Pissis | first3 = Solon P. | last4 = Radoszewski | first4 = Jakub | date = Aug 2021 | title = Faster Algorithms for Longest Common Substring | conference = European Symposium on Algorithms | editor1-last=Mutzel|editor1-first=Petra|editor2-last=Pagh|editor2-first=Rasmus|editor3-last=Herman|editor3-first=Grzegorz | publisher=Schloss Dagstuhl | series=Leibniz International Proceedings in Informatics (LIPIcs) | volume=204 | doi = 10.4230/LIPIcs.ESA.2021.30 | doi-access = free }} Here: Theorem 1, p.30:2.</ref> Solving the problem by [[dynamic programming]] costs <math>\Theta(nm)</math>. The solutions to the generalized problem take <math>\Theta(n_1 +
===Suffix tree===
Line 30 ⟶ 31:
===Dynamic programming===
The following pseudocode finds the set of longest common substrings between two strings with [[Dynamic_programming#Computer_science|dynamic programming]]:
'''function'''
L := '''array'''(1..r, 1..n)
z := 0 # length of the longest common substring found so far
ret := {}
Line 46 ⟶ 47:
'''if''' L[i, j] > z
z := L[i, j]
ret := {S[(i − z + 1)..i]}
'''else''' '''if''' L[i, j] = z
ret := ret ∪ {S[(i − z + 1)..i]}
'''else'''
L[i, j] := 0
'''return''' ret
This algorithm runs in <math>O(n r)</math> time. The array <code>L</code> stores the length of the longest common
The following tricks can be used to reduce the memory usage of an implementation:
* Keep only the last and current row of the DP table to save memory (<math>O(\min(r, n))</math> instead of <math>O(n r)</math>)
** The last and current row can be stored on the same 1D array by traversing the inner loop backwards
* Store only non-zero values in the rows. This can be done using hash-tables instead of arrays. This is useful for large alphabets.
|