Revision as of 06:18, 9 March 2012 edit Utcursch (talk \| contribs) Edit filter managers, Autopatrolled, Administrators 164,016 edits added Category:Genetics; removed {{uncategorized}} using HotCat ← Previous edit		Revision as of 09:03, 9 March 2012 edit undo Rkalendar (talk \| contribs) 224 edits mNo edit summary Next edit →
Line 1: The linguistic complexity (LC) measure <ref>{{cite book\| author=[http://evolution.haifa.ac.il/index.php/people/item/40-edward-n-trifonov-phd Edward N. Trifonov] \|year=1990\| book=Structure & Methods\| title=Structure and Methods\| series= Human Genome Initiative and DNA Recombination\| volume=1\| pages=69–77\|chapter=Making sense of the human genome\|publisher=Adenine Press, New York}}</ref> was introduced as a measure of the ‘vocabulary ~~richness’of~~richness’ of a text. When a [[nucleotide]] sequence is studied as a text written in the four-letter alphabet, the repetitiveness of such a text, that is, the extensive repetition of some [[N-gram\|N-grams (words)]], can be calculated, and served as a measure of sequence complexity. Thus, the more complex a [[DNA_sequence\|DNA sequence]], the richer is its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. We have recently improved the original algorithm described in (Trifonov 1990) without changing the essence of the linguistic complexity approach. The meaning of LC may be better understood by regarding the presentation of a sequence as a tree of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a complexity measure. The number of nodes at the tree level {{math\|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math\|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math\|<var>i</var>}} is either 4<sup>i</sup> or N-j+1, whichever is smaller. Complexity ({{math\|<var>C</var>}}) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U<sub>i</sub>): Line 9: The sequence analysis complexity calculation method can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect [[Direct_repeat\|direct]] or [[Inverted_repeat\|inverted repeats]], polypurine and polypyrimidine [[Triple-stranded_DNA\|triple-stranded DNA structures]], and four-stranded structures (such as [[G-quadruplex\|G-quadruplexes]]) <ref>{{cite journal\| author=Andrei Gabrielian, Alexander Bolshoy\|year=1999\| journal=Computer & Chemistry\| title=Sequence complexity and DNA curvature\| volume=23\| pages=263-274\| doi=10.1016/S0097-8485(99)00007-8}}</ref>, <ref>{{cite journal\| author=Orlov Y.L. ~~Orlov~~, Potapov V.N. ~~Potapov~~\|year=2004\| journal=Nucleic Acids Res.\| title=Complexity: an internet resource for analysis of DNA sequence complexity\| volume=32\| pages=W628–W633\| doi=10.1093/nar/gkh466}}</ref>, <ref>{{cite journal\| author=Svante Janson, Stefano Lonardi, Wojciech Szpankowski\|year=2004\| journal=Theoretical Computer Science\| title=On average sequence complexity \| volume=326\| pages=213–227\| doi=10.1016/j.tcs.2004.06.023}}</ref>, <ref>{{cite journal\| author=Kalendar R, Lee D, Schulman AH \|year=2011\| journal=Genomics\| title=Java web tools for PCR, <i>in silico</i> PCR, and oligonucleotide assembly and analysis\|pmid=21569836\|volume=98\| issue=2\| pages=137-144\| doi=10.1016/j.ygeno.2011.04.009}}</ref>. Line 18: [[Category:Genetics]] [[Category:Bioinformatics]]

Linguistic sequence complexity: Difference between revisions