Revision as of 13:26, 11 March 2012 edit Rkalendar (talk \| contribs) 224 edits mNo edit summary ← Previous edit		Revision as of 08:48, 14 March 2012 edit undo Stigmatella aurantiaca (talk \| contribs) Extended confirmed users 8,849 edits Refs go after the punctuation. Added cite doi templates. Next edit →
Line 1: The linguistic complexity (LC) measure <ref>{{cite book\| author=[http://evolution.haifa.ac.il/index.php/people/item/40-edward-n-trifonov-phd Edward N. Trifonov] \|year=1990\| book=Structure & Methods\| title=Structure and Methods\| series= Human Genome Initiative and DNA Recombination\| volume=1\| pages=69–77\|chapter=Making sense of the human genome\|publisher=Adenine Press, New York}}</ref> was introduced as a measure of the ‘vocabulary richness’ of a text. When a [[nucleotide]] sequence is studied as a text written in the four-letter alphabet, the repetitiveness of such a text, that is, the extensive repetition of some [[N-gram\|N-grams (words)]], can be calculated, and served as a measure of sequence complexity. Thus, the more complex a [[DNA_sequence\|DNA sequence]], the richer is its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. We have recently improved the original algorithm described in (Trifonov 1990) without changing the essence of the linguistic complexity approach .<ref name=Gabrielian1999>{{cite ~~journal~~doi\| ~~author=Andrei Gabrielian, Alexander Bolshoy\|year=1999\| journal=Computer & Chemistry\| title=Sequence complexity and DNA curvature\| volume=23\| pages=263-274\| doi=~~10.1016/S0097-8485(99)00007-8\|noedit}}}</ref>, <ref name=Orlov2004>{{cite ~~journal~~doi\| author=Orlov Yuriy Lvovich, Potapov Vladimir Nikilaevich \|year=2004\| journal=Nucleic Acids Research\| title=Complexity: an internet resource for analysis of DNA sequence complexity\| volume=32\| pages=W628–W633\| doi=10.1093/nar/gkh466\|noedit}}}</ref>, <ref name=Janson2004>{{cite ~~journal~~doi\| ~~author=Svante Janson, Stefano Lonardi, Wojciech Szpankowski\|year=2004\| journal=Theoretical Computer Science\| title=On average sequence complexity \| volume=326\| pages=213–227\| doi=~~10.1016/j.tcs.2004.06.023\|noedit}}}</ref>. The meaning of LC may be better understood by regarding the presentation of a sequence as a tree of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a complexity measure. The number of nodes at the tree level {{math\|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math\|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math\|<var>i</var>}} is either 4<sup>i</sup> or N-j+1, whichever is smaller. Complexity ({{math\|<var>C</var>}}) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U<sub>i</sub>): Line 9: The sequence analysis complexity calculation method can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect [[Direct_repeat\|direct]] or [[Inverted_repeat\|inverted repeats]], polypurine and polypyrimidine [[Triple-stranded_DNA\|triple-stranded DNA structures]], and four-stranded structures (such as [[G-quadruplex\|G-quadruplexes]]) .<ref name=Kalendar2011>{{cite ~~journal~~doi\| author=Kalendar R, Lee D, Schulman AH \|year=2011\| journal=Genomics\| title=Java web tools for PCR, <i>in silico</i> PCR, and oligonucleotide assembly and analysis\|pmid=21569836\|volume=98\| issue=2\| pages=137-144\| doi=10.1016/j.ygeno.2011.04.009\|noedit}}}</ref>. == References ==

Linguistic sequence complexity: Difference between revisions