Linguistic sequence complexity: Difference between revisions

Content deleted Content added
Rkalendar (talk | contribs)
mNo edit summary
OAbot (talk | contribs)
m Open access bot: doi updated in citation with #oabot.
 
(43 intermediate revisions by 23 users not shown)
Line 1:
The'''Linguistic linguisticsequence complexity''' (LC) is a measure of the 'vocabulary richness' of a genetic text in [[gene sequence]]s.<ref name=Trifonov1990>{{cite book| author=[http://evolutionEdward N.haifa.ac.il/index.php/people/item/40 Trifonov| author-edward-n-trifonov-phd link=Edward N. Trifonov] |year=1990| booktitle=Structure &and Methods|, title=StructureVol. and1 Methods| series= Human Genome Initiative and DNA Recombination|; volume=1|Proceedings of the Sixth Conversation in the Discipline Biomolecular Stereodynamics |pages=69–77 |chapter=Making sense of the human genome|publisher=Adenine Press |___location=Albany, New York}}</ref> was introduced as a measure of the ‘vocabulary richness’ of a text.
When a [[nucleotide]] sequence is studiedwritten as a text writtenusing in thea four-letter alphabet, the repetitiveness of such athe text, that is, the extensive repetition of someits [[N-gram|N-grams]]s (words)]], can be calculated, and servedserves as a measure of sequence complexity. Thus, the more complex a [[DNA_sequence|DNA sequence]], the richer is its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. WeSubsequent have recentlywork improved the original algorithm described in ([[Edward Trifonov|Trifonov]] (1990),<ref name=Trifonov1990 /> without changing the essence of the linguistic complexity approach.<ref name=Gabrielian1999>{{Cite journal | last1 = Gabrielian | first1 = A. | title = Sequence complexity and DNA curvature | doi = 10.1016/S0097-8485(99)00007-8 | journal = Computers & Chemistry | volume = 23 | issue = 3–4 | pages = 263–274| year = 1999 | pmid = 10404619}}</ref><ref name=Orlov2004>{{Cite journal | last1 = Orlov | first1 = Y. L. | last2 = Potapov | first2 = V. N. | doi = 10.1093/nar/gkh466 | title = Complexity: An internet resource for analysis of DNA sequence complexity | journal = Nucleic Acids Research | volume = 32 | issue = Web Server issue | pages = W628–W633 | year = 2004 | pmid = 15215465| pmc =441604 }}</ref><ref name=Janson2004>{{Cite journal | last1 = Janson | first1 = S. | last2 = Lonardi | first2 = S. | last3 = Szpankowski | first3 = W. | author3-link = Wojciech Szpankowski | title = On average sequence complexity | doi = 10.1016/j.tcs.2004.06.023 | journal = Theoretical Computer Science | volume = 326 | pages = 213–227 | year = 2004 | issue = 1–3 | doi-access = free }}</ref>
 
The meaning of LC may be better understood by regarding the presentation of a sequence as a [[Tree (data structure)|tree]] of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a [[Computer linguistics|complexity measure]]. The number of nodes at the tree level {{math|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math|<var>i</var>}} is either 4<sup>i</sup> or N-ji+1, whichever is smaller. Complexity ({{math|<var>C</var>}}) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U<sub>i</sub>):<ref name=Gabrielian1999 />
 
{{nb5}} <math>C = U_1 U_2...U_i....U_w </math>
 
Vocabulary usage for oligomers of a given size {{math|<var>i</var>}} can be defined as the ratio of the actual vocabulary size of a given sequence to the maximal possible vocabulary size for a sequence of that length. For example, U<sub>2</sub> for the sequence ACGGGAAGCTGATTCCA = 14/16, as it contains 14 of 16 possible different dinucleotides; U<sub>3</sub> for the same sequence = 15/15, and U<sub>4</sub>=14/14. For the sequence ACACACACACACACACA, U<sub>1</sub>=1/2;U<sub>2</sub>=2/16=0.125, as it has a simple vocabulary of only two dinucleotides; U<sub>3</sub> for this sequence = 2/15. k-tuples with k from two to W considered, while W depends on RW. For RW values less than 18, W is equal to 3; for RW less than 67, W is equal to 4; for RW<260,W=5; for RW<1029,W=6, and so on. The value of {{math|<var>C</var>}} provides a measure of sequence complexity in the convenient range 0<C<1 for various DNA sequence fragments of a given length. This novel formula is different from the previous LC measure in two respects: in the way vocabulary usage U<sub>i</sub> is calculated, and because {{math|<var>i</var>}} is not in the range of 2 to N-1 but only up to W. This new limitation on the range of U<sub>i</sub> makes the algorithm substantially more effective without loss of power.
 
 
The sequence analysis complexity calculation method can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect [[Direct_repeat|direct]] or [[Inverted_repeat|inverted repeats]], polypurine and polypyrimidine [[Triple-stranded_DNA|triple-stranded DNA structures]], and four-stranded structures (such as [[G-quadruplex|G-quadruplexes]]) <ref>{{cite journal| author=Andrei Gabrielian, Alexander Bolshoy|year=1999| journal=Computer & Chemistry| title=Sequence complexity and DNA curvature| volume=23| pages=263-274| doi=10.1016/S0097-8485(99)00007-8}}</ref>, <ref>{{cite journal| author=Orlov Yuriy Lvovich, Potapov Vladimir Nikilaevich |year=2004| journal=Nucleic Acids Research| title=Complexity: an internet resource for analysis of DNA sequence complexity| volume=32| pages=W628–W633| doi=10.1093/nar/gkh466}}</ref>, <ref>{{cite journal| author=Svante Janson, Stefano Lonardi, Wojciech Szpankowski|year=2004| journal=Theoretical Computer Science| title=On average sequence complexity | volume=326| pages=213–227| doi=10.1016/j.tcs.2004.06.023}}</ref>, <ref>{{cite journal| author=Kalendar R, Lee D, Schulman AH |year=2011| journal=Genomics| title=Java web tools for PCR, <i>in silico</i> PCR, and oligonucleotide assembly and analysis|pmid=21569836|volume=98| issue=2| pages=137-144| doi=10.1016/j.ygeno.2011.04.009}}</ref>.
 
Vocabulary usage for [[oligomers]] of a given size {{math|<var>i</var>}} can be defined as the ratio of the actual vocabulary size of a given sequence to the maximal possible vocabulary size for a sequence of that length. For example, U<sub>2</sub> for the sequence ACGGGAAGCTGATTCCA = 14/16, as it contains 14 of 16 possible different dinucleotides; U<sub>3</sub> for the same sequence = 15/15, and U<sub>4</sub>=14/14. For the sequence ACACACACACACACACA, U<sub>1</sub>=1/2; U<sub>2</sub>=2/16=0.125, as it has a simple vocabulary of only two dinucleotides; U<sub>3</sub> for this sequence = 2/15. k-tuples with k from two to W considered, while W depends on RW. For RW values less than 18, W is equal to 3; for RW less than 67, W is equal to 4; for RW<260, W=5; for RW<1029, W=6, and so on. The value of {{math|<var>C</var>}} provides a measure of sequence complexity in the convenient range 0<C<1 for various DNA sequence fragments of a given length. This novel formula is different from the previous LC measure in two respects: in the way vocabulary usage U<sub>i</sub>ref isname=Gabrielian1999 calculated, and because {{math|<var>i</var>}} is not in the range of 2 to N-1 but only up to W. This new limitation on the range of U<sub>i</sub> makes the algorithm substantially more effective without loss of power.
This formula is different from the original LC measure<ref name=Trifonov1990 /> in two respects: in the way vocabulary usage U<sub>i</sub> is calculated, and because {{math|<var>i</var>}} is not in the range of 2 to N-1 but only up to W. This limitation on the range of U<sub>i</sub> makes the algorithm substantially more efficient without loss of power.<ref name=Gabrielian1999 />
In <ref name=TAKLB01>{{Cite journal | doi = 10.1093/bioinformatics/18.5.679| title = Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity| journal = Bioinformatics| volume = 18| issue = 5| pages = 679–88| year = 2002| last1 = Troyanskaya | first1 = O. G.| last2 = Arbell | first2 = O.| last3 = Koren | first3 = Y.| last4 = Landau | first4 = G. M.| last5 = Bolshoy | first5 = A. | pmid=12050064| doi-access = free}}</ref> {{what|date=July 2023}} was used another modified version, wherein linguistic complexity (LC) is defined as the ratio of the number of substrings of any length present in the string to the maximum possible number of substrings. Maximum vocabulary over word sizes 1 to m can be calculated according to the simple formula .<ref name=TAKLB01 />
This sequence analysis complexity calculation can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect [[Direct repeat|direct]] or [[inverted repeat]]s, polypurine and polypyrimidine [[Triple-stranded DNA|triple-stranded DNA structures]], and four-stranded structures (such as [[G-quadruplex]]es).<ref name=Kalendar2011>{{Cite journal | last1 = Kalendar | first1 = R. | last2 = Lee | first2 = D. | last3 = Schulman | first3 = A. H. | doi = 10.1016/j.ygeno.2011.04.009 | title = Java web tools for PCR, in silico PCR, and oligonucleotide assembly and analysis | journal = Genomics | volume = 98 | issue = 2 | pages = 137–144 | year = 2011 | pmid = 21569836 | doi-access = }}</ref>
 
== References ==
{{reflistReflist}}
 
[[Category:Nucleic acids]]
[[Category:Genetics]]
[[Category:Bioinformatics]]