Revision as of 17:05, 25 March 2017 edit Emir of Wikipedia (talk \| contribs) Extended confirmed users, Page movers, New page reviewers, Rollbackers 53,626 edits Help needed: Complexity measure Tag: Dispenser [2.4] ← Previous edit		Revision as of 17:18, 11 July 2017 edit undo Onel5969 (talk \| contribs) Autopatrolled, Extended confirmed users, Page movers, New page reviewers, Pending changes reviewers, Rollbackers 992,441 edits m Disambiguating links to Complexity measure (link changed to Computer linguistics) using DisamAssist. Next edit →
Line 2: When a [[nucleotide]] sequence is written as text using a four-letter alphabet, the repetitiveness of the text, that is, the repetition of its [[N-gram]]s (words), can be calculated and serves as a measure of sequence complexity. Thus, the more complex a [[DNA sequence]], the richer its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. Subsequent work improved the original algorithm described in [[Edward Trifonov\|Trifonov]] (1990),<ref name=Trifonov1990 /> without changing the essence of the linguistic complexity approach.<ref name=Gabrielian1999>{{Cite journal \| last1 = Gabrielian \| first1 = A. \| title = Sequence complexity and DNA curvature \| doi = 10.1016/S0097-8485(99)00007-8 \| journal = Computers & Chemistry \| volume = 23 \| issue = 3–4 \| pages = 263–201 \| year = 1999 \| pmid = \| pmc = }}</ref><ref name=Orlov2004>{{Cite journal \| last1 = Orlov \| first1 = Y. L. \| last2 = Potapov \| first2 = V. N. \| doi = 10.1093/nar/gkh466 \| title = Complexity: An internet resource for analysis of DNA sequence complexity \| journal = Nucleic Acids Research \| volume = 32 \| issue = Web Server issue \| pages = W628–W633 \| year = 2004 \| pmid = 15215465\| pmc =441604 }}</ref><ref name=Janson2004>{{Cite journal \| last1 = Janson \| first1 = S. \| last2 = Lonardi \| first2 = S. \| last3 = Szpankowski \| first3 = W. \| author3-link = Wojciech Szpankowski \| title = On average sequence complexity \| doi = 10.1016/j.tcs.2004.06.023 \| journal = Theoretical Computer Science \| volume = 326 \| pages = 213–227 \| year = 2004 \| pmid = \| pmc = }}</ref> The meaning of LC may be better understood by regarding the presentation of a sequence as a [[Tree (data structure)\|tree]] of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a [[Computer linguistics\|complexity measure]]~~{{dn\|date=March 2017}}~~. The number of nodes at the tree level {{math\|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math\|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math\|<var>i</var>}} is either 4<sup>i</sup> or N-j+1, whichever is smaller. Complexity ({{math\|<var>C</var>}}) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U<sub>i</sub>):<ref name=Gabrielian1999 /> {{nb5}} <math>C = U_1 U_2...U_i....U_w </math>

Linguistic sequence complexity: Difference between revisions