Content deleted Content added
mNo edit summary |
m Open access bot: doi updated in citation with #oabot. |
||
(54 intermediate revisions by 25 users not shown) | |||
Line 1:
When a [[nucleotide]] sequence is written as text using a four-letter alphabet, the repetitiveness of the text, that is, the repetition of its [[N-gram]]s (words), can be calculated and serves as a measure of sequence complexity. Thus, the more complex a [[DNA sequence]], the richer its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. Subsequent work improved the original algorithm described in [[Edward Trifonov|Trifonov]] (1990),<ref name=Trifonov1990 /> without changing the essence of the linguistic complexity approach.<ref name=Gabrielian1999>{{Cite journal | last1 = Gabrielian | first1 = A. | title = Sequence complexity and DNA curvature | doi = 10.1016/S0097-8485(99)00007-8 | journal = Computers & Chemistry | volume = 23 | issue = 3–4 | pages = 263–274| year = 1999 | pmid = 10404619}}</ref><ref name=Orlov2004>{{Cite journal | last1 = Orlov | first1 = Y. L. | last2 = Potapov | first2 = V. N. | doi = 10.1093/nar/gkh466 | title = Complexity: An internet resource for analysis of DNA sequence complexity | journal = Nucleic Acids Research | volume = 32 | issue = Web Server issue | pages = W628–W633 | year = 2004 | pmid = 15215465| pmc =441604 }}</ref><ref name=Janson2004>{{Cite journal | last1 = Janson | first1 = S. | last2 = Lonardi | first2 = S. | last3 = Szpankowski | first3 = W. | author3-link = Wojciech Szpankowski | title = On average sequence complexity | doi = 10.1016/j.tcs.2004.06.023 | journal = Theoretical Computer Science | volume = 326 | pages = 213–227 | year = 2004 | issue = 1–3 | doi-access = free }}</ref>
The meaning of LC may be better understood by regarding the presentation of a sequence as a [[Tree (data structure)|tree]] of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a [[Computer linguistics|complexity measure]]. The number of nodes at the tree level {{math|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math|<var>i</var>}} is either 4<sup>i</sup> or N-
{{nb5}} <math>C = U_1 U_2...U_i....U_w </math>
Vocabulary usage
This In <ref name=TAKLB01>{{Cite journal | doi = 10.1093/bioinformatics/18.5.679| title = Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity| journal = Bioinformatics| volume = 18| issue = 5| pages = 679–88| year = 2002| last1 = Troyanskaya | first1 = O. G.| last2 = Arbell | first2 = O.| last3 = Koren | first3 = Y.| last4 = Landau | first4 = G. M.| last5 = Bolshoy | first5 = A. | pmid=12050064| doi-access = free}}</ref> {{what|date=July 2023}} was used another modified version, wherein linguistic complexity (LC) is defined as the ratio of the number of substrings of any length present in the string to the maximum possible number of substrings. Maximum vocabulary over word sizes 1 to m can be calculated according to the simple formula .<ref name=TAKLB01 />
This sequence analysis complexity calculation can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect [[Direct repeat|direct]] or [[inverted repeat]]s, polypurine and polypyrimidine [[Triple-stranded DNA|triple-stranded DNA structures]], and four-stranded structures (such as [[G-quadruplex]]es).<ref name=Kalendar2011>{{Cite journal | last1 = Kalendar | first1 = R. | last2 = Lee | first2 = D. | last3 = Schulman | first3 = A. H. | doi = 10.1016/j.ygeno.2011.04.009 | title = Java web tools for PCR, in silico PCR, and oligonucleotide assembly and analysis | journal = Genomics | volume = 98 | issue = 2 | pages = 137–144 | year = 2011 | pmid = 21569836 | doi-access = }}</ref>
== References ==▼
{{Reflist}}
[[Category:Nucleic acids]]
[[Category:Bioinformatics]]
▲== References ==
|