Revision as of 09:48, 1 April 2020 edit 2a02:c7f:c42a:1100:8435:572d:1c4b:44c1 (talk) They had the number of two letter words wrong in paragraph 3 ← Previous edit		Revision as of 09:51, 1 April 2020 edit undo 2a02:c7f:c42a:1100:8435:572d:1c4b:44c1 (talk) Sorry incorrect edit, reverting. Next edit →
Line 6: {{nb5}} <math>C = U_1 U_2...U_i....U_w </math> Vocabulary usage for [[oligomers]] of a given size {{math\|<var>i</var>}} can be defined as the ratio of the actual vocabulary size of a given sequence to the maximal possible vocabulary size for a sequence of that length. For example, U<sub>2</sub> for the sequence ACGGGAAGCTGATTCCA = 1314/16, as it contains 1314 of 16 possible different dinucleotides; U<sub>3</sub> for the same sequence = 15/15, and U<sub>4</sub>=14/14. For the sequence ACACACACACACACACA, U<sub>1</sub>=1/2; U<sub>2</sub>=2/16=0.125, as it has a simple vocabulary of only two dinucleotides; U<sub>3</sub> for this sequence = 2/15. k-tuples with k from two to W considered, while W depends on RW. For RW values less than 18, W is equal to 3; for RW less than 67, W is equal to 4; for RW<260, W=5; for RW<1029, W=6, and so on. The value of {{math\|<var>C</var>}} provides a measure of sequence complexity in the range 0<C<1 for various DNA sequence fragments of a given length.<ref name=Gabrielian1999 /> This formula is different from the original LC measure<ref name=Trifonov1990 /> in two respects: in the way vocabulary usage U<sub>i</sub> is calculated, and because {{math\|<var>i</var>}} is not in the range of 2 to N-1 but only up to W. This limitation on the range of U<sub>i</sub> makes the algorithm substantially more efficient without loss of power.<ref name=Gabrielian1999 /> In <ref name=TAKLB01>{{Cite journal \| doi = 10.1093/bioinformatics/18.5.679\| title = Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity\| journal = Bioinformatics\| volume = 18\| issue = 5\| pages = 679–88\| year = 2002\| last1 = Troyanskaya \| first1 = O. G.\| last2 = Arbell \| first2 = O.\| last3 = Koren \| first3 = Y.\| last4 = Landau \| first4 = G. M.\| last5 = Bolshoy \| first5 = A. \| pmid=12050064}}</ref> was used another modified version, wherein linguistic complexity (LC) is defined as the ratio of the number of substrings of any length present in the string to the maximum possible number of substrings. Maximum vocabulary over word sizes 1 to m can be calculated according to the simple formula .<ref name=TAKLB01 />

Linguistic sequence complexity: Difference between revisions