Content deleted Content added
m Disambiguating links to Complexity measure (link changed to Computer linguistics) using DisamAssist. |
|||
Line 1:
'''Linguistic sequence complexity''' (LC) is a measure of the 'vocabulary richness' of a genetic text in [[gene sequence]]s.<ref name=Trifonov1990>{{cite book| author=[[Edward N. Trifonov]] |year=1990| title=Structure and Methods, Vol. 1 | series=Human Genome Initiative and DNA Recombination; Proceedings of the Sixth Conversation in the Discipline Biomolecular Stereodynamics |pages=69–77 |chapter=Making sense of the human genome|publisher=Adenine Press |___location=Albany, New York}}</ref>
When a [[nucleotide]] sequence is written as text using a four-letter alphabet, the repetitiveness of the text, that is, the repetition of its [[N-gram]]s (words), can be calculated and serves as a measure of sequence complexity. Thus, the more complex a [[DNA sequence]], the richer its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. Subsequent work improved the original algorithm described in [[Edward Trifonov|Trifonov]] (1990),<ref name=Trifonov1990 /> without changing the essence of the linguistic complexity approach.<ref name=Gabrielian1999>{{Cite journal | last1 = Gabrielian | first1 = A. | title = Sequence complexity and DNA curvature | doi = 10.1016/S0097-8485(99)00007-8 | journal = Computers & Chemistry | volume = 23 | issue = 3–4 | pages =
The meaning of LC may be better understood by regarding the presentation of a sequence as a [[Tree (data structure)|tree]] of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a [[Computer linguistics|complexity measure]]. The number of nodes at the tree level {{math|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math|<var>i</var>}} is either 4<sup>i</sup> or N-j+1, whichever is smaller. Complexity ({{math|<var>C</var>}}) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U<sub>i</sub>):<ref name=Gabrielian1999 />
|