Content deleted Content added
m 2 revisions imported: import old edits from "Algorithms for calculating variance/Talk" in the August 2001 database dump |
|||
(14 intermediate revisions by 13 users not shown) | |||
Line 1:
{{WikiProject
{{
{{WikiProject Mathematics| importance = mid }}
}}
== Online algorithm in testing yields horrible results when mean=~0 ==
Line 7 ⟶ 8:
Most all the tests I've seen of these algorithms add some unrealistic constant (i.e. 10^6 or larger) to the dataset to demonstrate that the suggested algorithm on this page is indeed better. I naively used this algorithm in my own work, to horrible effect. My dataset consists of a large number of discrete values, perhaps with the values -1, 0, or 1, and with an average usually between -1 and 1. I wrote the following simple test program to demonstrate the difference in results between what I'll call METHOD1 (using a running sum and sum of squares of the dataset) and METHOD2 (using a running computation of the average and the variance, which the current wiki strongly recommends).
<
#include <iostream>
#include <iomanip>
Line 78 ⟶ 79:
return 0;
}
</syntaxhighlight>
And here sample output:
<
numtrials float avg 2 float avg 1 double avg 2 double avg 1 float std 2 float std 1 double std 2 double std 1
1000000 0.948275 0.950115 0.950115 0.950115 0.218147 0.217707 0.217707 0.217707
Line 123 ⟶ 124:
38000000 0.816234 0.441506 0.950036 0.950036 0.195328 0.496567 0.217871 0.217871
39000000 0.816234 0.430185 0.950032 0.950032 0.194878 0.495102 0.217878 0.217878
</syntaxhighlight>
Note how the computed average using float's and method 2 fails to six digits accuracy before even 1 million trials, while method 1 using floats reproduces the double results all the way out to 17 million trials. <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/173.219.85.45|173.219.85.45]] ([[User talk:173.219.85.45|talk]]) 19:41, 19 October 2012 (UTC)</span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot-->
:Method 1's sum += sample is exact (with sample values 1 and 0) until it has no effect, so method 1 fails catastrophically after about 18 million. Method 2's average += delta / numtrials is generally imprecise, and you can't expect single precision to be great here for numtrials in 6 or 7 digits, but it's not too far off on average, so even with single floats and these large numbers of samples the estimated variances are not too unreasonable. [[User:Dicklyon|Dicklyon]] ([[User talk:Dicklyon|talk]]) 21:18, 22 May 2018 (UTC)
==Untitled==
Line 242 ⟶ 244:
The Welford algorithm *is* an on-line algorithm, because it needs to see only one data item at a time! It does *not* need to store the complete set of data, i.e., it does not need to look at any data item more than once. <small><span class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Snoopy67|Snoopy67]] ([[User talk:Snoopy67|talk]] • [[Special:Contributions/Snoopy67|contribs]]) 21:30, 17 July 2011 (UTC)</span></small><!-- Template:Unsigned --> <!--Autosigned by SineBot-->
:The Welford algorithm as written calculates the sample variance in the finalize routine, and nowhere in the description does it specifically call that out. <!-- Template:Unsigned --><small class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:192.91.171.34|192.91.171.34]] ([[User talk:192.91.171.34#top|talk]] • [[Special:Contributions/192.91.171.34|contribs]]) </small>
===Parallel algorithm===
Is there an error in <math>M_{2,AB}</math>?
Chan et al. in its paper <ref name=":0">{{Citation
| last1 = Chan | first1 = Tony F. | author1-link = Tony F. Chan
| last2 = Golub | first2 = Gene H. | author2-link = Gene H. Golub
| last3 = LeVeque | first3 = Randall J.
| contribution = Updating Formulae and a Pairwise Algorithm for Computing Sample Variances.
| title = Technical Report STAN-CS-79-773
| publisher = Department of Computer Science, Stanford University
| year = 1979
| contribution-url =http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf }}.</ref> has another relation for combining arbitrary sets <math>A</math> and <math>B</math>:
<math>M_{2,AB} = M_{2,A} + M_{2,B} + \frac{n_A}{n_B n_{AB}} \cdot \left( \frac{n_B}{n_A} \bar x_A - \bar x_B \right)^2 </math>, which yelds to <math>M_{2,AB} = M_{2,A} + M_{2,B} + \frac{1}{n_{AB}} \cdot \left( \bar x_A - \bar x_B \right)^2 </math> if <math>n_A</math> and <math>n_B</math> are equal.
=== Algorithm IV ===
Line 345 ⟶ 357:
: Btw, I don't understand your comment at the beginning since the problematic division is outside the loop. [[User:McKay|McKay]] ([[User talk:McKay|talk]]) 01:02, 10 April 2009 (UTC)
::: It should not say "else variance = 0". That implies the sample variance is 0 when ''n'' = 1. That is incorrect. The sample variance is undefined in that case. [[User:Michael Hardy|Michael Hardy]] ([[User talk:Michael Hardy|talk]]) 01:18, 10 April 2009 (UTC)
{{reflist-talk}}
== Easiest online algorithm yet. ==
Line 395 ⟶ 409:
== Pseudocode is valid Python, but gets tripped up by integer division ==
Would anyone object to a note pointing out that the pseudocode can be run verbatim in a Python intepreter, but that in Python < 3.0 you'll need a <
::I think it would be inappropriate. The whole idea of pseudocode is that it's not language specific. -- [[User:RoySmith|RoySmith]] [[User Talk:RoySmith|(talk)]] 18:52, 25 December 2010 (UTC)
Line 494 ⟶ 508:
== Weighted Incremental Algorithm Alternative Incorrect ==
<
def weighted_incremental_variance(dataWeightPairs):
...
M2 = M2 + sumweight * delta * R # Alternatively, "M2 = M2 + weight * delta * (x−mean)"
...
</syntaxhighlight>
I did the math to see if the stated update formula is indeed correct with the alternative mentioned in the comment; I have found that it is not. As such, I propose the alternative comment be removed.
Line 580 ⟶ 594:
Revision of {{oldid|Algorithms for calculating variance|270157362|12 February 2009}} further "corrected" titles by eliminating roman numerals. Today's revision of {{oldid|Algorithms for calculating variance|707525590}} replaced roman numerical references from the text according to the above map. [[User:Ale2006|ale]] ([[User talk:Ale2006|talk]]) 10:32, 29 February 2016 (UTC)
Issues with the section called "Welford's online algorithm".
1. I don't see the point of these equations:
{\displaystyle s_{n}^{2}={\frac {n-2}{n-1}}\,s_{n-1}^{2}+{\frac {(x_{n}-{\bar {x}}_{n-1})^{2}}{n}}=s_{n-1}^{2}+{\frac {(x_{n}-{\bar {x}}_{n-1})^{2}}{n}}-{\frac {s_{n-1}^{2}}{n-1}},\quad n>1}
{\displaystyle \sigma _{n}^{2}={\frac {(n-1)\,\sigma _{n-1}^{2}+(x_{n}-{\bar {x}}_{n-1})(x_{n}-{\bar {x}}_{n})}{n}}=\sigma _{n-1}^{2}+{\frac {(x_{n}-{\bar {x}}_{n-1})(x_{n}-{\bar {x}}_{n})-\sigma _{n-1}^{2}}{n}}.} {\displaystyle \sigma _{n}^{2}={\frac {(n-1)\,\sigma _{n-1}^{2}+(x_{n}-{\bar {x}}_{n-1})(x_{n}-{\bar {x}}_{n})}{n}}=\sigma _{n-1}^{2}+{\frac {(x_{n}-{\bar {x}}_{n-1})(x_{n}-{\bar {x}}_{n})-\sigma _{n-1}^{2}}{n}}.}
They are just scaled versions of the equation that follows. I recommend removal.
2. The comment "These formulas suffer from numerical instability, as they repeatedly subtract a small number from a big number which scales with n." is confusing. It looks to me like some of the formulas repeatedly add a small number to a large number. In this respect they are no worse than the equation that follows.
3. The advantage of the {\displaystyle M_{2,n}} recurrence is that it is simpler and more computationally efficient because there are fewer multiplications and no divisions.
4. The last two equations should be
{\displaystyle {\begin{aligned}s_{n}^{2}&={\frac {M_{2,n}}{n-1}}\\[4pt]\sigma _{n}^{2}&={\frac {M_{2,n}}{n}}\end{aligned}}}
[[User:ProfRB|ProfRB]] ([[User talk:ProfRB|talk]]) 18:51, 22 March 2019 (UTC)
:2. I think the comment "These formulas suffer from numerical instability, as they repeatedly subtract a small number from a big number which scales with n" is actually wrong. Maybe there is an '''accuracy''' issue, but I don't see why there should be an instability here. A reference would be most welcome. [[User:Natchouf|Natchouf]] ([[User talk:Natchouf|talk]]) 14:08, 20 March 2023 (UTC)
== Typo in "Computing shifted data" section ==
The second shown formula in this section does not compute the population variance <math>\sigma^2</math>, it rather computes the sample variance <math>s^2</math>. This can easily be seen when comparing the (wrong) formula divisor <math>n-1</math> with the respective divisor <math>n</math> as used in the section "Naive algorithm" just above (where capital <math>N</math> is used instead of <math>n</math>).
The comments in the corresponding code snippet below makes the situation a bit clearer. Maybe one could write the code explicitly as
...
// for sample variance use
variance = (Ex2 - Ex**2 / n) / (n - 1)
// for population variance use
// variance = (Ex2 - Ex**2 / n) / n
...
[[Special:Contributions/141.249.133.134|141.249.133.134]] ([[User talk:141.249.133.134|talk]]) 06:17, 10 April 2024 (UTC)
|