Content deleted Content added
Curly ‘ ’ block web searches, replaced with ' '. Tense. Spacing. Citations needed. Original research tags. |
m Open access bot: doi updated in citation with #oabot. |
||
(39 intermediate revisions by 23 users not shown) | |||
Line 1:
When a [[nucleotide]] sequence is
▲The linguistic complexity (LC) measure <ref name=Trifonov1990>{{cite book| author=[http://evolution.haifa.ac.il/index.php/people/item/40-edward-n-trifonov-phd Edward N. Trifonov] |year=1990| book=Structure & Methods| title=Structure and Methods| series= Human Genome Initiative and DNA Recombination| volume=1| pages=69–77|chapter=Making sense of the human genome|publisher=Adenine Press, New York}}</ref> was introduced as a measure of the 'vocabulary richness' of a text.
▲When a [[nucleotide]] sequence is studied as a text written in the four-letter alphabet, the repetitiveness of such a text, that is, the extensive repetition of some [[N-gram|N-grams (words)]], can be calculated, and serves as a measure of sequence complexity. Thus, the more complex a [[DNA_sequence|DNA sequence]], the richer is its [[oligonucleotide]] vocabulary, whereas repetitious sequences have relatively lower complexities. We have recently improved the original algorithm described in (Trifonov 1990)<ref name=Trifonov1990/> without changing the essence of the linguistic complexity approach.{{Or}}<ref name=Gabrielian1999>{{cite doi|10.1016/S0097-8485(99)00007-8|noedit}}}</ref><ref name=Orlov2004>{{cite doi|10.1093/nar/gkh466|noedit}}}</ref><ref name=Janson2004>{{cite doi|10.1016/j.tcs.2004.06.023|noedit}}}</ref>
The meaning of LC may be better understood by regarding the presentation of a sequence as a [[Tree (data structure)|tree]] of all subsequences of the given sequence. The most complex sequences have maximally balanced trees, while the measure of imbalance or tree asymmetry serves as a [[Computer linguistics|complexity measure]]. The number of nodes at the tree level {{math|<var>i</var>}} is equal to the actual vocabulary size of words with the length {{math|<var>i</var>}} in a given sequence; the number of nodes in the most balanced tree, which corresponds to the most complex sequence of length N, at the tree level {{math|<var>i</var>}} is either 4<sup>i</sup> or N-
{{nb5}} <math>C = U_1 U_2...U_i....U_w </math>
Vocabulary usage for [[oligomers]] of a given size {{math|<var>i</var>}} can be defined as the ratio of the actual vocabulary size of a given sequence to the maximal possible vocabulary size for a sequence of that length. For example, U<sub>2</sub> for the sequence ACGGGAAGCTGATTCCA = 14/16, as it contains 14 of 16 possible different dinucleotides; U<sub>3</sub> for the same sequence = 15/15, and U<sub>4</sub>=14/14. For the sequence ACACACACACACACACA, U<sub>1</sub>=1/2; U<sub>2</sub>=2/16=0.125, as it has a simple vocabulary of only two dinucleotides; U<sub>3</sub> for this sequence = 2/15. k-tuples with k from two to W considered, while W depends on RW. For RW values less than 18, W is equal to 3; for RW less than 67, W is equal to 4; for RW<260, W=5; for RW<1029, W=6, and so on. The value of {{math|<var>C</var>}} provides a measure of sequence complexity in the
This formula is different from the In <ref name=TAKLB01>{{Cite journal | doi = 10.1093/bioinformatics/18.5.679| title = Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity| journal = Bioinformatics| volume = 18| issue = 5| pages = 679–88| year = 2002| last1 = Troyanskaya | first1 = O. G.| last2 = Arbell | first2 = O.| last3 = Koren | first3 = Y.| last4 = Landau | first4 = G. M.| last5 = Bolshoy | first5 = A. | pmid=12050064| doi-access = free}}</ref> {{what|date=July 2023}} was used another modified version, wherein linguistic complexity (LC) is defined as the ratio of the number of substrings of any length present in the string to the maximum possible number of substrings. Maximum vocabulary over word sizes 1 to m can be calculated according to the simple formula .<ref name=TAKLB01 />
== References ==
{{
[[Category:Nucleic acids]]
|