Content deleted Content added
mNo edit summary |
mNo edit summary |
||
Line 1:
The linguistic complexity (LC) measure <ref>{{cite book| author=E.N Trifonov |year=1990| book=Structure & Methods| title=Structure and Methods| series= Human Genome Initiative and DNA Recombination| volume=1| pages=69–77|chapter=Making sense of the human genome|publisher=Adenine Press, New York}}</ref> was introduced as a measure of the ‘vocabulary richness’of a text.
When a [[nucleotide]] sequence is studied as a text written in the four-letter alphabet, the repetitiveness of such a text, that is, the extensive repetition of some [[N-gram|N-grams (words)]], can be calculated, and served as a measure of sequence complexity. Thus, the more complex a [[
The meaning of LC may be better understood by regarding the presentation of a sequence as a tree of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a complexity measure. The number of nodes at the tree level {{math|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math|<var>i</var>}} is either 4<sup>i</sup> or N-j+1, whichever is smaller. Complexity ({{math|<var>C</var>}}) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U<sub>i</sub>):
|