Content deleted Content added
added Category:Genetics; removed {{uncategorized}} using HotCat |
mNo edit summary |
||
Line 1:
The linguistic complexity (LC) measure <ref>{{cite book| author=[http://evolution.haifa.ac.il/index.php/people/item/40-edward-n-trifonov-phd Edward N. Trifonov] |year=1990| book=Structure & Methods| title=Structure and Methods| series= Human Genome Initiative and DNA Recombination| volume=1| pages=69–77|chapter=Making sense of the human genome|publisher=Adenine Press, New York}}</ref> was introduced as a measure of the ‘vocabulary
When a [[nucleotide]] sequence is studied as a text written in the four-letter alphabet, the repetitiveness of such a text, that is, the extensive repetition of some [[N-gram|N-grams (words)]], can be calculated, and served as a measure of sequence complexity. Thus, the more complex a [[DNA_sequence|DNA sequence]], the richer is its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. We have recently improved the original algorithm described in (Trifonov 1990) without changing the essence of the linguistic complexity approach.
The meaning of LC may be better understood by regarding the presentation of a sequence as a tree of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a complexity measure. The number of nodes at the tree level {{math|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math|<var>i</var>}} is either 4<sup>i</sup> or N-j+1, whichever is smaller. Complexity ({{math|<var>C</var>}}) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (U<sub>i</sub>):
Line 9:
The sequence analysis complexity calculation method can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect [[Direct_repeat|direct]] or [[Inverted_repeat|inverted repeats]], polypurine and polypyrimidine [[Triple-stranded_DNA|triple-stranded DNA structures]], and four-stranded structures (such as [[G-quadruplex|G-quadruplexes]]) <ref>{{cite journal| author=Andrei Gabrielian, Alexander Bolshoy|year=1999| journal=Computer & Chemistry| title=Sequence complexity and DNA curvature| volume=23| pages=263-274| doi=10.1016/S0097-8485(99)00007-8}}</ref>, <ref>{{cite journal| author=Orlov Y.L.
Line 18:
[[Category:Genetics]]
[[Category:Bioinformatics]]
|