Revision as of 13:13, 24 March 2012 edit Yobot (talk \| contribs) Bots 4,733,870 edits m WP:CHECKWIKI error 38\|53\|51 fixes + general fixes using AWB (8037) ← Previous edit		Revision as of 08:20, 26 March 2012 edit undo Galapah (talk \| contribs) Extended confirmed users 601 edits m +wikilinks to Edward Trifonov on Wikipedia Next edit →
Line 1: '''Linguistic sequence complexity''' (LC) is a measure of the 'vocabulary richness' of a text.<ref name=Trifonov1990>{{cite book\| author=[~~http://evolution.haifa.ac.il/index.php/people/item/40-edward-n-trifonov-phd~~ [Edward N. Trifonov]] \|year=1990\| book=Structure & Methods\| title=Structure and Methods\| series= Human Genome Initiative and DNA Recombination\| volume=1\| pages=69–77\|chapter=Making sense of the human genome\|publisher=Adenine Press, New York}}</ref> When a [[nucleotide]] sequence is written as text using a four-letter alphabet, the repetitiveness of the text, that is, the repetition of its [[N-gram\|N-grams (words)]], can be calculated and serves as a measure of sequence complexity. Thus, the more complex a [[DNA sequence]], the richer its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. Subsequent work improved the original algorithm described in ([[Edward Trifonov\|Trifonov]] 1990)<ref name=Trifonov1990/> without changing the essence of the linguistic complexity approach.<ref name=Gabrielian1999>{{cite doi\|10.1016/S0097-8485(99)00007-8\|noedit}}</ref><ref name=Orlov2004>{{cite doi\|10.1093/nar/gkh466\|noedit}}</ref><ref name=Janson2004>{{cite doi\|10.1016/j.tcs.2004.06.023\|noedit}}</ref> The meaning of LC may be better understood by regarding the presentation of a sequence as a [[Tree (data structure)\|tree]] of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a complexity measure. The number of nodes at the tree level {{math\|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math\|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math\|<var>i</var>}} is either 4<sup>i</sup> or N-j+1, whichever is smaller. Complexity ({{math\|<var>C</var>}}) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U<sub>i</sub>):<ref name=Gabrielian1999 />

Linguistic sequence complexity: Difference between revisions