Revision as of 11:17, 14 March 2012 edit Stigmatella aurantiaca (talk \| contribs) Extended confirmed users 8,849 edits Opening made bold, removed repetitiveness. ← Previous edit		Revision as of 03:49, 15 March 2012 edit undo Stigmatella aurantiaca (talk \| contribs) Extended confirmed users 8,849 edits Added wikilinks, streamlined a few sentences. Next edit →
Line 1: {{Original research\|date=March 2012}} '''Linguistic sequence complexity''' (LC) is a measure of the 'vocabulary richness' of a text.<ref name=Trifonov1990>{{cite book\| author=[http://evolution.haifa.ac.il/index.php/people/item/40-edward-n-trifonov-phd Edward N. Trifonov] \|year=1990\| book=Structure & Methods\| title=Structure and Methods\| series= Human Genome Initiative and DNA Recombination\| volume=1\| pages=69–77\|chapter=Making sense of the human genome\|publisher=Adenine Press, New York}}</ref> When a [[nucleotide]] sequence is ~~studied~~written as a text ~~written~~using ~~in the~~a four-letter alphabet, the repetitiveness of ~~such a~~the text, that is, the repetition of its [[N-gram\|N-grams (words)]], can be calculated and serves as a measure of sequence complexity. Thus, the more complex a [[DNA_sequence\|DNA sequence]], the richer its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. We have recently improved the original algorithm described in (Trifonov 1990)<ref name=Trifonov1990/> without changing the essence of the linguistic complexity approach.{{Or\|date=March 2012}}<ref name=Gabrielian1999>{{cite doi\|10.1016/S0097-8485(99)00007-8\|noedit}}}</ref><ref name=Orlov2004>{{cite doi\|10.1093/nar/gkh466\|noedit}}}</ref><ref name=Janson2004>{{cite doi\|10.1016/j.tcs.2004.06.023\|noedit}}}</ref> The meaning of LC may be better understood by regarding the presentation of a sequence as a [[Tree (data structure)\|tree]] of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a complexity measure. The number of nodes at the tree level {{math\|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math\|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math\|<var>i</var>}} is either 4<sup>i</sup> or N-j+1, whichever is smaller. Complexity ({{math\|<var>C</var>}}) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U<sub>i</sub>):{{Citation needed\|date=March 2012}} <math>C = U_1 U_2...U_i....U_w </math> Vocabulary usage for [[oligomers]] of a given size {{math\|<var>i</var>}} can be defined as the ratio of the actual vocabulary size of a given sequence to the maximal possible vocabulary size for a sequence of that length. For example, U<sub>2</sub> for the sequence ACGGGAAGCTGATTCCA = 14/16, as it contains 14 of 16 possible different dinucleotides; U<sub>3</sub> for the same sequence = 15/15, and U<sub>4</sub>=14/14. For the sequence ACACACACACACACACA, U<sub>1</sub>=1/2; U<sub>2</sub>=2/16=0.125, as it has a simple vocabulary of only two dinucleotides; U<sub>3</sub> for this sequence = 2/15. k-tuples with k from two to W considered, while W depends on RW. For RW values less than 18, W is equal to 3; for RW less than 67, W is equal to 4; for RW<260, W=5; for RW<1029, W=6, and so on.{{Clarify\|post-text=W looks like a logarithmic measure, but the numbers don't check out very well on a calculator.\|date=March 2012}} The value of {{math\|<var>C</var>}} provides a measure of sequence complexity in the ~~convenient~~ range 0<C<1 for various DNA sequence fragments of a given length.{{Citation needed\|date=March 2012}} This novel formula is different from the previous LC measure in two respects: in the way vocabulary usage U<sub>i</sub> is calculated, and because {{math\|<var>i</var>}} is not in the range of 2 to N-1 but only up to W. This new limitation on the range of U<sub>i</sub> makes the algorithm substantially more effective without loss of power.{{Or\|date=March 2012}} This sequence analysis complexity calculation ~~method~~ can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect [[Direct_repeat\|direct]] or [[Inverted_repeat\|inverted repeats]], polypurine and polypyrimidine [[Triple-stranded_DNA\|triple-stranded DNA structures]], and four-stranded structures (such as [[G-quadruplex\|G-quadruplexes]]).<ref name=Kalendar2011>{{cite doi\|10.1016/j.ygeno.2011.04.009\|noedit}}}</ref> == References ==

Linguistic sequence complexity: Difference between revisions