Revision as of 10:12, 18 March 2012 edit Stigmatella aurantiaca (talk \| contribs) Extended confirmed users 8,849 edits As currently reworded, the original research tag can probably be removed. ← Previous edit		Revision as of 20:12, 21 March 2012 edit undo Rkalendar (talk \| contribs) 224 edits mNo edit summary Next edit →
Line 6: <math>C = U_1 U_2...U_i....U_w </math> Vocabulary usage for [[oligomers]] of a given size {{math\|<var>i</var>}} can be defined as the ratio of the actual vocabulary size of a given sequence to the maximal possible vocabulary size for a sequence of that length. For example, U<sub>2</sub> for the sequence ACGGGAAGCTGATTCCA = 14/16, as it contains 14 of 16 possible different dinucleotides; U<sub>3</sub> for the same sequence = 15/15, and U<sub>4</sub>=14/14. For the sequence ACACACACACACACACA, U<sub>1</sub>=1/2; U<sub>2</sub>=2/16=0.125, as it has a simple vocabulary of only two dinucleotides; U<sub>3</sub> for this sequence = 2/15. k-tuples with k from two to W considered, while W depends on RW. For RW values less than 18, W is equal to 3; for RW less than 67, W is equal to 4; for RW<260, W=5; for RW<1029, W=6, and so on.~~{{Clarify\|post-text=W looks like a logarithmic measure, but the numbers don't check out very well on a calculator.\|date=March 2012}}~~ The value of {{math\|<var>C</var>}} provides a measure of sequence complexity in the range 0<C<1 for various DNA sequence fragments of a given length.<ref name=Gabrielian1999></ref> This formula is different from the original LC measure<ref name=Trifonov1990/> in two respects: in the way vocabulary usage U<sub>i</sub> is calculated, and because {{math\|<var>i</var>}} is not in the range of 2 to N-1 but only up to W. This limitation on the range of U<sub>i</sub> makes the algorithm substantially more efficient without loss of power.<ref name=Gabrielian1999></ref>

Linguistic sequence complexity: Difference between revisions