Linguistic sequence complexity: Difference between revisions

Content deleted Content added
added Category:Genetics; removed {{uncategorized}} using HotCat
Rkalendar (talk | contribs)
mNo edit summary
Line 1:
The linguistic complexity (LC) measure <ref>{{cite book| author=[http://evolution.haifa.ac.il/index.php/people/item/40-edward-n-trifonov-phd Edward N. Trifonov] |year=1990| book=Structure & Methods| title=Structure and Methods| series= Human Genome Initiative and DNA Recombination| volume=1| pages=69–77|chapter=Making sense of the human genome|publisher=Adenine Press, New York}}</ref> was introduced as a measure of the ‘vocabulary richness’ofrichness’ of a text.
When a [[nucleotide]] sequence is studied as a text written in the four-letter alphabet, the repetitiveness of such a text, that is, the extensive repetition of some [[N-gram|N-grams (words)]], can be calculated, and served as a measure of sequence complexity. Thus, the more complex a [[DNA_sequence|DNA sequence]], the richer is its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. We have recently improved the original algorithm described in (Trifonov 1990) without changing the essence of the linguistic complexity approach.
 
The meaning of LC may be better understood by regarding the presentation of a sequence as a tree of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a complexity measure. The number of nodes at the tree level {{math|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math|<var>i</var>}} is either 4<sup>i</sup> or N-j+1, whichever is smaller. Complexity ({{math|<var>C</var>}}) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U<sub>i</sub>):
Line 9:
 
 
The sequence analysis complexity calculation method can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect [[Direct_repeat|direct]] or [[Inverted_repeat|inverted repeats]], polypurine and polypyrimidine [[Triple-stranded_DNA|triple-stranded DNA structures]], and four-stranded structures (such as [[G-quadruplex|G-quadruplexes]]) <ref>{{cite journal| author=Andrei Gabrielian, Alexander Bolshoy|year=1999| journal=Computer & Chemistry| title=Sequence complexity and DNA curvature| volume=23| pages=263-274| doi=10.1016/S0097-8485(99)00007-8}}</ref>, <ref>{{cite journal| author=Orlov Y.L. Orlov, Potapov V.N. Potapov|year=2004| journal=Nucleic Acids Res.| title=Complexity: an internet resource for analysis of DNA sequence complexity| volume=32| pages=W628–W633| doi=10.1093/nar/gkh466}}</ref>, <ref>{{cite journal| author=Svante Janson, Stefano Lonardi, Wojciech Szpankowski|year=2004| journal=Theoretical Computer Science| title=On average sequence complexity | volume=326| pages=213–227| doi=10.1016/j.tcs.2004.06.023}}</ref>, <ref>{{cite journal| author=Kalendar R, Lee D, Schulman AH |year=2011| journal=Genomics| title=Java web tools for PCR, <i>in silico</i> PCR, and oligonucleotide assembly and analysis|pmid=21569836|volume=98| issue=2| pages=137-144| doi=10.1016/j.ygeno.2011.04.009}}</ref>.
 
 
Line 18:
 
[[Category:Genetics]]
[[Category:Bioinformatics]]