Content deleted Content added
mNo edit summary |
mNo edit summary |
||
Line 1:
The linguistic complexity (LC) measure <ref>{{cite book| author=[http://evolution.haifa.ac.il/index.php/people/item/40-edward-n-trifonov-phd Edward N. Trifonov] |year=1990| book=Structure & Methods| title=Structure and Methods| series= Human Genome Initiative and DNA Recombination| volume=1| pages=69–77|chapter=Making sense of the human genome|publisher=Adenine Press, New York}}</ref> was introduced as a measure of the ‘vocabulary richness’ of a text.
When a [[nucleotide]] sequence is studied as a text written in the four-letter alphabet, the repetitiveness of such a text, that is, the extensive repetition of some [[N-gram|N-grams (words)]], can be calculated, and served as a measure of sequence complexity. Thus, the more complex a [[DNA_sequence|DNA sequence]], the richer is its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. We have recently improved the original algorithm described in (Trifonov 1990) without changing the essence of the linguistic complexity approach <ref>{{cite journal| author=Andrei Gabrielian, Alexander Bolshoy|year=1999| journal=Computer & Chemistry| title=Sequence complexity and DNA curvature| volume=23| pages=263-274| doi=10.1016/S0097-8485(99)00007-8}}</ref>, <ref>{{cite journal| author=Orlov Yuriy Lvovich, Potapov Vladimir Nikilaevich |year=2004| journal=Nucleic Acids Research| title=Complexity: an internet resource for analysis of DNA sequence complexity| volume=32| pages=W628–W633| doi=10.1093/nar/gkh466}}</ref>, <ref>{{cite journal| author=Svante Janson, Stefano Lonardi, Wojciech Szpankowski|year=2004| journal=Theoretical Computer Science| title=On average sequence complexity | volume=326| pages=213–227| doi=10.1016/j.tcs.2004.06.023}}</ref>.
The meaning of LC may be better understood by regarding the presentation of a sequence as a tree of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a complexity measure. The number of nodes at the tree level {{math|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math|<var>i</var>}} is either 4<sup>i</sup> or N-j+1, whichever is smaller. Complexity ({{math|<var>C</var>}}) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U<sub>i</sub>):
Line 9:
The sequence analysis complexity calculation method can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect [[Direct_repeat|direct]] or [[Inverted_repeat|inverted repeats]], polypurine and polypyrimidine [[Triple-stranded_DNA|triple-stranded DNA structures]], and four-stranded structures (such as [[G-quadruplex|G-quadruplexes]])
|