Content deleted Content added
m WP:CHECKWIKI error 38|53|51 fixes + general fixes using AWB (8037) |
m +wikilinks to Edward Trifonov on Wikipedia |
||
Line 1:
'''Linguistic sequence complexity''' (LC) is a measure of the 'vocabulary richness' of a text.<ref name=Trifonov1990>{{cite book| author=[
When a [[nucleotide]] sequence is written as text using a four-letter alphabet, the repetitiveness of the text, that is, the repetition of its [[N-gram|N-grams (words)]], can be calculated and serves as a measure of sequence complexity. Thus, the more complex a [[DNA sequence]], the richer its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. Subsequent work improved the original algorithm described in ([[Edward Trifonov|Trifonov]] 1990)<ref name=Trifonov1990/> without changing the essence of the linguistic complexity approach.<ref name=Gabrielian1999>{{cite doi|10.1016/S0097-8485(99)00007-8|noedit}}</ref><ref name=Orlov2004>{{cite doi|10.1093/nar/gkh466|noedit}}</ref><ref name=Janson2004>{{cite doi|10.1016/j.tcs.2004.06.023|noedit}}</ref>
The meaning of LC may be better understood by regarding the presentation of a sequence as a [[Tree (data structure)|tree]] of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a complexity measure. The number of nodes at the tree level {{math|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math|<var>i</var>}} is either 4<sup>i</sup> or N-j+1, whichever is smaller. Complexity ({{math|<var>C</var>}}) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U<sub>i</sub>):<ref name=Gabrielian1999 />
|