Talk:Algorithms for calculating variance: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Add topic

Revision as of 18:19, 22 May 2018 edit 192.91.171.34 (talk) →Online algorithm ← Previous edit		Latest revision as of 11:22, 23 December 2024 edit undo Graham87 (talk \| contribs) Autopatrolled, Event coordinators, Extended confirmed users, Page movers, Importers, Rollbackers, Temporary account IP viewers 297,887 edits m 2 revisions imported: import old edits from "Algorithms for calculating variance/Talk" in the August 2001 database dump
(14 intermediate revisions by 13 users not shown)
Line 1: {{WikiProject ~~Statistics~~banner shell\| class = ~~start~~ Start\| ~~importance = mid }}~~ {{~~maths rating \| class = start~~WikiProject Statistics\| importance = mid ~~\| field = probability and statistics~~}} {{WikiProject Mathematics\| importance = mid }} }} == Online algorithm in testing yields horrible results when mean=~0 == Line 7 ⟶ 8: Most all the tests I've seen of these algorithms add some unrealistic constant (i.e. 10^6 or larger) to the dataset to demonstrate that the suggested algorithm on this page is indeed better. I naively used this algorithm in my own work, to horrible effect. My dataset consists of a large number of discrete values, perhaps with the values -1, 0, or 1, and with an average usually between -1 and 1. I wrote the following simple test program to demonstrate the difference in results between what I'll call METHOD1 (using a running sum and sum of squares of the dataset) and METHOD2 (using a running computation of the average and the variance, which the current wiki strongly recommends). <~~source~~syntaxhighlight lang="cpp"> #include <iostream> #include <iomanip> Line 78 ⟶ 79: return 0; } </syntaxhighlight> ~~</source>~~ And here sample output: <~~source~~syntaxhighlight lang="text"> numtrials float avg 2 float avg 1 double avg 2 double avg 1 float std 2 float std 1 double std 2 double std 1 1000000 0.948275 0.950115 0.950115 0.950115 0.218147 0.217707 0.217707 0.217707 Line 123 ⟶ 124: 38000000 0.816234 0.441506 0.950036 0.950036 0.195328 0.496567 0.217871 0.217871 39000000 0.816234 0.430185 0.950032 0.950032 0.194878 0.495102 0.217878 0.217878 </syntaxhighlight> ~~</source>~~ Note how the computed average using float's and method 2 fails to six digits accuracy before even 1 million trials, while method 1 using floats reproduces the double results all the way out to 17 million trials. <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures\|unsigned]] comment added by [[Special:Contributions/173.219.85.45\|173.219.85.45]] ([[User talk:173.219.85.45\|talk]]) 19:41, 19 October 2012 (UTC)</span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot--> :Method 1's sum += sample is exact (with sample values 1 and 0) until it has no effect, so method 1 fails catastrophically after about 18 million. Method 2's average += delta / numtrials is generally imprecise, and you can't expect single precision to be great here for numtrials in 6 or 7 digits, but it's not too far off on average, so even with single floats and these large numbers of samples the estimated variances are not too unreasonable. [[User:Dicklyon\|Dicklyon]] ([[User talk:Dicklyon\|talk]]) 21:18, 22 May 2018 (UTC) ==Untitled== Line 242 ⟶ 244: The Welford algorithm is an on-line algorithm, because it needs to see only one data item at a time! It does not need to store the complete set of data, i.e., it does not need to look at any data item more than once. <small><span class="autosigned">— Preceding [[Wikipedia:Signatures\|unsigned]] comment added by [[User:Snoopy67\|Snoopy67]] ([[User talk:Snoopy67\|talk]] • [[Special:Contributions/Snoopy67\|contribs]]) 21:30, 17 July 2011 (UTC)</span></small><!-- Template:Unsigned --> <!--Autosigned by SineBot--> :The Welford algorithm as written calculates the sample variance in the finalize routine, and nowhere in the description does it specifically call that out. <!-- Template:Unsigned --><small class="autosigned">— Preceding [[Wikipedia:Signatures\|unsigned]] comment added by [[User:192.91.171.34\|192.91.171.34]] ([[User talk:192.91.171.34#top\|talk]] • [[Special:Contributions/192.91.171.34\|contribs]]) </small> ===Parallel algorithm=== Is there an error in <math>M_{2,AB}</math>? ~~----~~ Chan et al. in its paper <ref name=":0">{{Citation \| last1 = Chan \| first1 = Tony F. \| author1-link = Tony F. Chan \| last2 = Golub \| first2 = Gene H. \| author2-link = Gene H. Golub \| last3 = LeVeque \| first3 = Randall J. \| contribution = Updating Formulae and a Pairwise Algorithm for Computing Sample Variances. \| title = Technical Report STAN-CS-79-773 \| publisher = Department of Computer Science, Stanford University \| year = 1979 \| contribution-url =http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf }}.</ref> has another relation for combining arbitrary sets <math>A</math> and <math>B</math>: <math>M_{2,AB} = M_{2,A} + M_{2,B} + \frac{n_A}{n_B n_{AB}} \cdot \left( \frac{n_B}{n_A} \bar x_A - \bar x_B \right)^2 </math>, which yelds to <math>M_{2,AB} = M_{2,A} + M_{2,B} + \frac{1}{n_{AB}} \cdot \left( \bar x_A - \bar x_B \right)^2 </math> if <math>n_A</math> and <math>n_B</math> are equal. === Algorithm IV === Line 345 ⟶ 357: : Btw, I don't understand your comment at the beginning since the problematic division is outside the loop. [[User:McKay\|McKay]] ([[User talk:McKay\|talk]]) 01:02, 10 April 2009 (UTC) ::: It should not say "else variance = 0". That implies the sample variance is 0 when ''n'' = 1. That is incorrect. The sample variance is undefined in that case. [[User:Michael Hardy\|Michael Hardy]] ([[User talk:Michael Hardy\|talk]]) 01:18, 10 April 2009 (UTC) {{reflist-talk}} == Easiest online algorithm yet. == Line 395 ⟶ 409: == Pseudocode is valid Python, but gets tripped up by integer division == Would anyone object to a note pointing out that the pseudocode can be run verbatim in a Python intepreter, but that in Python < 3.0 you'll need a <~~source~~syntaxhighlight lang="python">from __future__ import division</~~source~~syntaxhighlight> statement. (Can you tell I just got caught out by this? :D) [[User:Alexsdutton\|Alexsdutton]] ([[User talk:Alexsdutton\|talk]]) 15:05, 23 December 2010 (UTC) ::I think it would be inappropriate. The whole idea of pseudocode is that it's not language specific. -- [[User:RoySmith\|RoySmith]] [[User Talk:RoySmith\|(talk)]] 18:52, 25 December 2010 (UTC) Line 494 ⟶ 508: == Weighted Incremental Algorithm Alternative Incorrect == <~~source~~syntaxhighlight lang="python"> def weighted_incremental_variance(dataWeightPairs): ... M2 = M2 + sumweight * delta * R # Alternatively, "M2 = M2 + weight * delta * (x−mean)" ... </syntaxhighlight> ~~</source>~~ I did the math to see if the stated update formula is indeed correct with the alternative mentioned in the comment; I have found that it is not. As such, I propose the alternative comment be removed. Line 580 ⟶ 594: Revision of {{oldid\|Algorithms for calculating variance\|270157362\|12 February 2009}} further "corrected" titles by eliminating roman numerals. Today's revision of {{oldid\|Algorithms for calculating variance\|707525590}} replaced roman numerical references from the text according to the above map. [[User:Ale2006\|ale]] ([[User talk:Ale2006\|talk]]) 10:32, 29 February 2016 (UTC) Issues with the section called "Welford's online algorithm". 1. I don't see the point of these equations: {\displaystyle s_{n}^{2}={\frac {n-2}{n-1}}\,s_{n-1}^{2}+{\frac {(x_{n}-{\bar {x}}_{n-1})^{2}}{n}}=s_{n-1}^{2}+{\frac {(x_{n}-{\bar {x}}_{n-1})^{2}}{n}}-{\frac {s_{n-1}^{2}}{n-1}},\quad n>1} {\displaystyle \sigma _{n}^{2}={\frac {(n-1)\,\sigma _{n-1}^{2}+(x_{n}-{\bar {x}}_{n-1})(x_{n}-{\bar {x}}_{n})}{n}}=\sigma _{n-1}^{2}+{\frac {(x_{n}-{\bar {x}}_{n-1})(x_{n}-{\bar {x}}_{n})-\sigma _{n-1}^{2}}{n}}.} {\displaystyle \sigma _{n}^{2}={\frac {(n-1)\,\sigma _{n-1}^{2}+(x_{n}-{\bar {x}}_{n-1})(x_{n}-{\bar {x}}_{n})}{n}}=\sigma _{n-1}^{2}+{\frac {(x_{n}-{\bar {x}}_{n-1})(x_{n}-{\bar {x}}_{n})-\sigma _{n-1}^{2}}{n}}.} They are just scaled versions of the equation that follows. I recommend removal. 2. The comment "These formulas suffer from numerical instability, as they repeatedly subtract a small number from a big number which scales with n." is confusing. It looks to me like some of the formulas repeatedly add a small number to a large number. In this respect they are no worse than the equation that follows. 3. The advantage of the {\displaystyle M_{2,n}} recurrence is that it is simpler and more computationally efficient because there are fewer multiplications and no divisions. 4. The last two equations should be {\displaystyle {\begin{aligned}s_{n}^{2}&={\frac {M_{2,n}}{n-1}}\\[4pt]\sigma _{n}^{2}&={\frac {M_{2,n}}{n}}\end{aligned}}} [[User:ProfRB\|ProfRB]] ([[User talk:ProfRB\|talk]]) 18:51, 22 March 2019 (UTC) :2. I think the comment "These formulas suffer from numerical instability, as they repeatedly subtract a small number from a big number which scales with n" is actually wrong. Maybe there is an '''accuracy''' issue, but I don't see why there should be an instability here. A reference would be most welcome. [[User:Natchouf\|Natchouf]] ([[User talk:Natchouf\|talk]]) 14:08, 20 March 2023 (UTC) == Typo in "Computing shifted data" section == The second shown formula in this section does not compute the population variance <math>\sigma^2</math>, it rather computes the sample variance <math>s^2</math>. This can easily be seen when comparing the (wrong) formula divisor <math>n-1</math> with the respective divisor <math>n</math> as used in the section "Naive algorithm" just above (where capital <math>N</math> is used instead of <math>n</math>). The comments in the corresponding code snippet below makes the situation a bit clearer. Maybe one could write the code explicitly as ... // for sample variance use variance = (Ex2 - Ex2 / n) / (n - 1) // for population variance use // variance = (Ex2 - Ex2 / n) / n ... [[Special:Contributions/141.249.133.134\|141.249.133.134]] ([[User talk:141.249.133.134\|talk]]) 06:17, 10 April 2024 (UTC)