Content deleted Content added
m Bot: Deprecating Template:Cite doi and some minor fixes |
m Open access bot: doi updated in citation with #oabot. |
||
(17 intermediate revisions by 14 users not shown) | |||
Line 1:
'''Linguistic sequence complexity''' (LC) is a measure of the 'vocabulary richness' of a genetic text in [[gene sequence]]s.<ref name=Trifonov1990>{{cite book| author=
When a [[nucleotide]] sequence is written as text using a four-letter alphabet, the repetitiveness of the text, that is, the repetition of its [[N-gram]]s (words), can be calculated and serves as a measure of sequence complexity. Thus, the more complex a [[DNA sequence]], the richer its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. Subsequent work improved the original algorithm described in [[Edward Trifonov|Trifonov]] (1990),<ref name=Trifonov1990 /> without changing the essence of the linguistic complexity approach.<ref name=Gabrielian1999>{{Cite journal | last1 = Gabrielian | first1 = A. | title = Sequence complexity and DNA curvature | doi = 10.1016/S0097-8485(99)00007-8 | journal = Computers & Chemistry | volume = 23 | issue = 3–4 | pages =
The meaning of LC may be better understood by regarding the presentation of a sequence as a [[Tree (data structure)|tree]] of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a [[Computer linguistics|complexity measure]]. The number of nodes at the tree level {{math|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math|<var>i</var>}} is either 4<sup>i</sup> or N-
{{nb5}} <math>C = U_1 U_2...U_i....U_w </math>
Line 8:
Vocabulary usage for [[oligomers]] of a given size {{math|<var>i</var>}} can be defined as the ratio of the actual vocabulary size of a given sequence to the maximal possible vocabulary size for a sequence of that length. For example, U<sub>2</sub> for the sequence ACGGGAAGCTGATTCCA = 14/16, as it contains 14 of 16 possible different dinucleotides; U<sub>3</sub> for the same sequence = 15/15, and U<sub>4</sub>=14/14. For the sequence ACACACACACACACACA, U<sub>1</sub>=1/2; U<sub>2</sub>=2/16=0.125, as it has a simple vocabulary of only two dinucleotides; U<sub>3</sub> for this sequence = 2/15. k-tuples with k from two to W considered, while W depends on RW. For RW values less than 18, W is equal to 3; for RW less than 67, W is equal to 4; for RW<260, W=5; for RW<1029, W=6, and so on. The value of {{math|<var>C</var>}} provides a measure of sequence complexity in the range 0<C<1 for various DNA sequence fragments of a given length.<ref name=Gabrielian1999 />
This formula is different from the original LC measure<ref name=Trifonov1990 /> in two respects: in the way vocabulary usage U<sub>i</sub> is calculated, and because {{math|<var>i</var>}} is not in the range of 2 to N-1 but only up to W. This limitation on the range of U<sub>i</sub> makes the algorithm substantially more efficient without loss of power.<ref name=Gabrielian1999 />
In <ref name=TAKLB01>{{Cite journal | doi = 10.1093/bioinformatics/18.5.679| title = Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity| journal = Bioinformatics| volume = 18| issue = 5| pages =
This sequence analysis complexity calculation can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect [[Direct repeat|direct]] or [[inverted repeat]]s, polypurine and polypyrimidine [[Triple-stranded DNA|triple-stranded DNA structures]], and four-stranded structures (such as [[G-quadruplex]]es).<ref name=Kalendar2011>{{
== References ==
{{
[[Category:Nucleic acids]]
|