Talk:Algorithms for calculating variance: Difference between revisions

Content deleted Content added
Fr.dror (talk | contribs)
mNo edit summary
m 2 revisions imported: import old edits from "Algorithms for calculating variance/Talk" in the August 2001 database dump
 
(7 intermediate revisions by 7 users not shown)
Line 1:
{{WikiProject Statisticsbanner shell| class = start Start| importance = mid }}
{{WikiProject Statistics| importance = mid }}
{{maths rating |frequentlyviewed = yes | class = start | importance = mid | field = probability and statistics}}
{{WikiProject Mathematics| importance = mid }}
 
}}
 
== Online algorithm in testing yields horrible results when mean=~0 ==
Line 7 ⟶ 8:
Most all the tests I've seen of these algorithms add some unrealistic constant (i.e. 10^6 or larger) to the dataset to demonstrate that the suggested algorithm on this page is indeed better. I naively used this algorithm in my own work, to horrible effect. My dataset consists of a large number of discrete values, perhaps with the values -1, 0, or 1, and with an average usually between -1 and 1. I wrote the following simple test program to demonstrate the difference in results between what I'll call METHOD1 (using a running sum and sum of squares of the dataset) and METHOD2 (using a running computation of the average and the variance, which the current wiki strongly recommends).
 
<sourcesyntaxhighlight lang="cpp">
#include <iostream>
#include <iomanip>
Line 78 ⟶ 79:
return 0;
}
</syntaxhighlight>
</source>
 
And here sample output:
 
<sourcesyntaxhighlight lang="text">
numtrials float avg 2 float avg 1 double avg 2 double avg 1 float std 2 float std 1 double std 2 double std 1
1000000 0.948275 0.950115 0.950115 0.950115 0.218147 0.217707 0.217707 0.217707
Line 123 ⟶ 124:
38000000 0.816234 0.441506 0.950036 0.950036 0.195328 0.496567 0.217871 0.217871
39000000 0.816234 0.430185 0.950032 0.950032 0.194878 0.495102 0.217878 0.217878
</syntaxhighlight>
</source>
 
Note how the computed average using float's and method 2 fails to six digits accuracy before even 1 million trials, while method 1 using floats reproduces the double results all the way out to 17 million trials. <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/173.219.85.45|173.219.85.45]] ([[User talk:173.219.85.45|talk]]) 19:41, 19 October 2012 (UTC)</span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot-->
Line 244 ⟶ 245:
 
:The Welford algorithm as written calculates the sample variance in the finalize routine, and nowhere in the description does it specifically call that out. <!-- Template:Unsigned --><small class="autosigned">—&nbsp;Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:192.91.171.34|192.91.171.34]] ([[User talk:192.91.171.34#top|talk]] • [[Special:Contributions/192.91.171.34|contribs]]) </small>
 
===Parallel algorithm===
Is there an error in <math>M_{2,AB}</math>?
Chan et al. in its paper <ref name=":0">{{Citation
| last1 = Chan | first1 = Tony F. | author1-link = Tony F. Chan
| last2 = Golub | first2 = Gene H. | author2-link = Gene H. Golub
| last3 = LeVeque | first3 = Randall J.
| contribution = Updating Formulae and a Pairwise Algorithm for Computing Sample Variances.
| title = Technical Report STAN-CS-79-773
| publisher = Department of Computer Science, Stanford University
| year = 1979
| contribution-url =http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf }}.</ref> has another relation for combining arbitrary sets <math>A</math> and <math>B</math>:
<math>M_{2,AB} = M_{2,A} + M_{2,B} + \frac{n_A}{n_B n_{AB}} \cdot \left( \frac{n_B}{n_A} \bar x_A - \bar x_B \right)^2 </math>, which yelds to <math>M_{2,AB} = M_{2,A} + M_{2,B} + \frac{1}{n_{AB}} \cdot \left( \bar x_A - \bar x_B \right)^2 </math> if <math>n_A</math> and <math>n_B</math> are equal.
 
=== Algorithm IV ===
Line 343 ⟶ 357:
: Btw, I don't understand your comment at the beginning since the problematic division is outside the loop. [[User:McKay|McKay]] ([[User talk:McKay|talk]]) 01:02, 10 April 2009 (UTC)
::: It should not say "else variance = 0". That implies the sample variance is 0 when ''n'' = 1. That is incorrect. The sample variance is undefined in that case. [[User:Michael Hardy|Michael Hardy]] ([[User talk:Michael Hardy|talk]]) 01:18, 10 April 2009 (UTC)
 
{{reflist-talk}}
 
== Easiest online algorithm yet. ==
Line 393 ⟶ 409:
 
== Pseudocode is valid Python, but gets tripped up by integer division ==
Would anyone object to a note pointing out that the pseudocode can be run verbatim in a Python intepreter, but that in Python < 3.0 you'll need a <sourcesyntaxhighlight lang="python">from __future__ import division</sourcesyntaxhighlight> statement. (Can you tell I just got caught out by this? :D) [[User:Alexsdutton|Alexsdutton]] ([[User talk:Alexsdutton|talk]]) 15:05, 23 December 2010 (UTC)
::I think it would be inappropriate. The whole idea of pseudocode is that it's not language specific. -- [[User:RoySmith|RoySmith]] [[User Talk:RoySmith|(talk)]] 18:52, 25 December 2010 (UTC)
 
Line 492 ⟶ 508:
== Weighted Incremental Algorithm Alternative Incorrect ==
 
<sourcesyntaxhighlight lang="python">
def weighted_incremental_variance(dataWeightPairs):
...
M2 = M2 + sumweight * delta * R # Alternatively, "M2 = M2 + weight * delta * (x−mean)"
...
</syntaxhighlight>
</source>
 
I did the math to see if the stated update formula is indeed correct with the alternative mentioned in the comment; I have found that it is not. As such, I propose the alternative comment be removed.
Line 592 ⟶ 608:
{\displaystyle {\begin{aligned}s_{n}^{2}&={\frac {M_{2,n}}{n-1}}\\[4pt]\sigma _{n}^{2}&={\frac {M_{2,n}}{n}}\end{aligned}}}
[[User:ProfRB|ProfRB]] ([[User talk:ProfRB|talk]]) 18:51, 22 March 2019 (UTC)
 
:2. I think the comment "These formulas suffer from numerical instability, as they repeatedly subtract a small number from a big number which scales with n" is actually wrong. Maybe there is an '''accuracy''' issue, but I don't see why there should be an instability here. A reference would be most welcome. [[User:Natchouf|Natchouf]] ([[User talk:Natchouf|talk]]) 14:08, 20 March 2023 (UTC)
 
== Typo in "Computing shifted data" section ==
 
The second shown formula in this section does not compute the population variance <math>\sigma^2</math>, it rather computes the sample variance <math>s^2</math>. This can easily be seen when comparing the (wrong) formula divisor <math>n-1</math> with the respective divisor <math>n</math> as used in the section "Naive algorithm" just above (where capital <math>N</math> is used instead of <math>n</math>).
 
The comments in the corresponding code snippet below makes the situation a bit clearer. Maybe one could write the code explicitly as
 
...
// for sample variance use
variance = (Ex2 - Ex**2 / n) / (n - 1)
// for population variance use
// variance = (Ex2 - Ex**2 / n) / n
...
 
[[Special:Contributions/141.249.133.134|141.249.133.134]] ([[User talk:141.249.133.134|talk]]) 06:17, 10 April 2024 (UTC)