Linguistic sequence complexity: Difference between revisions

Content deleted Content added
Rkalendar (talk | contribs)
mNo edit summary
Refs go after the punctuation. Added cite doi templates.
Line 1:
The linguistic complexity (LC) measure <ref>{{cite book| author=[http://evolution.haifa.ac.il/index.php/people/item/40-edward-n-trifonov-phd Edward N. Trifonov] |year=1990| book=Structure & Methods| title=Structure and Methods| series= Human Genome Initiative and DNA Recombination| volume=1| pages=69–77|chapter=Making sense of the human genome|publisher=Adenine Press, New York}}</ref> was introduced as a measure of the ‘vocabulary richness’ of a text.
When a [[nucleotide]] sequence is studied as a text written in the four-letter alphabet, the repetitiveness of such a text, that is, the extensive repetition of some [[N-gram|N-grams (words)]], can be calculated, and served as a measure of sequence complexity. Thus, the more complex a [[DNA_sequence|DNA sequence]], the richer is its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. We have recently improved the original algorithm described in (Trifonov 1990) without changing the essence of the linguistic complexity approach .<ref name=Gabrielian1999>{{cite journaldoi| author=Andrei Gabrielian, Alexander Bolshoy|year=1999| journal=Computer & Chemistry| title=Sequence complexity and DNA curvature| volume=23| pages=263-274| doi=10.1016/S0097-8485(99)00007-8|noedit}}}</ref>, <ref name=Orlov2004>{{cite journaldoi| author=Orlov Yuriy Lvovich, Potapov Vladimir Nikilaevich |year=2004| journal=Nucleic Acids Research| title=Complexity: an internet resource for analysis of DNA sequence complexity| volume=32| pages=W628–W633| doi=10.1093/nar/gkh466|noedit}}}</ref>, <ref name=Janson2004>{{cite journaldoi| author=Svante Janson, Stefano Lonardi, Wojciech Szpankowski|year=2004| journal=Theoretical Computer Science| title=On average sequence complexity | volume=326| pages=213–227| doi=10.1016/j.tcs.2004.06.023|noedit}}}</ref>.
 
The meaning of LC may be better understood by regarding the presentation of a sequence as a tree of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a complexity measure. The number of nodes at the tree level {{math|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math|<var>i</var>}} is either 4<sup>i</sup> or N-j+1, whichever is smaller. Complexity ({{math|<var>C</var>}}) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U<sub>i</sub>):
Line 9:
 
 
The sequence analysis complexity calculation method can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect [[Direct_repeat|direct]] or [[Inverted_repeat|inverted repeats]], polypurine and polypyrimidine [[Triple-stranded_DNA|triple-stranded DNA structures]], and four-stranded structures (such as [[G-quadruplex|G-quadruplexes]]) .<ref name=Kalendar2011>{{cite journaldoi| author=Kalendar R, Lee D, Schulman AH |year=2011| journal=Genomics| title=Java web tools for PCR, <i>in silico</i> PCR, and oligonucleotide assembly and analysis|pmid=21569836|volume=98| issue=2| pages=137-144| doi=10.1016/j.ygeno.2011.04.009|noedit}}}</ref>.
 
 
== References ==