Content deleted Content added
m 2 revisions imported: import old edits from "Algorithms for calculating variance/Talk" in the August 2001 database dump |
|||
(16 intermediate revisions by 14 users not shown) | |||
Line 1:
{{WikiProject
{{
{{WikiProject Mathematics| importance = mid }}
}}
== Online algorithm in testing yields horrible results when mean=~0 ==
Line 7 ⟶ 8:
Most all the tests I've seen of these algorithms add some unrealistic constant (i.e. 10^6 or larger) to the dataset to demonstrate that the suggested algorithm on this page is indeed better. I naively used this algorithm in my own work, to horrible effect. My dataset consists of a large number of discrete values, perhaps with the values -1, 0, or 1, and with an average usually between -1 and 1. I wrote the following simple test program to demonstrate the difference in results between what I'll call METHOD1 (using a running sum and sum of squares of the dataset) and METHOD2 (using a running computation of the average and the variance, which the current wiki strongly recommends).
<
#include <iostream>
#include <iomanip>
Line 78 ⟶ 79:
return 0;
}
</syntaxhighlight>
And here sample output:
<
numtrials float avg 2 float avg 1 double avg 2 double avg 1 float std 2 float std 1 double std 2 double std 1
1000000 0.948275 0.950115 0.950115 0.950115 0.218147 0.217707 0.217707 0.217707
Line 123 ⟶ 124:
38000000 0.816234 0.441506 0.950036 0.950036 0.195328 0.496567 0.217871 0.217871
39000000 0.816234 0.430185 0.950032 0.950032 0.194878 0.495102 0.217878 0.217878
</syntaxhighlight>
Note how the computed average using float's and method 2 fails to six digits accuracy before even 1 million trials, while method 1 using floats reproduces the double results all the way out to 17 million trials. <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/173.219.85.45|173.219.85.45]] ([[User talk:173.219.85.45|talk]]) 19:41, 19 October 2012 (UTC)</span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot-->
:Method 1's sum += sample is exact (with sample values 1 and 0) until it has no effect, so method 1 fails catastrophically after about 18 million. Method 2's average += delta / numtrials is generally imprecise, and you can't expect single precision to be great here for numtrials in 6 or 7 digits, but it's not too far off on average, so even with single floats and these large numbers of samples the estimated variances are not too unreasonable. [[User:Dicklyon|Dicklyon]] ([[User talk:Dicklyon|talk]]) 21:18, 22 May 2018 (UTC)
==Untitled==
Line 242 ⟶ 244:
The Welford algorithm *is* an on-line algorithm, because it needs to see only one data item at a time! It does *not* need to store the complete set of data, i.e., it does not need to look at any data item more than once. <small><span class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:Snoopy67|Snoopy67]] ([[User talk:Snoopy67|talk]] • [[Special:Contributions/Snoopy67|contribs]]) 21:30, 17 July 2011 (UTC)</span></small><!-- Template:Unsigned --> <!--Autosigned by SineBot-->
:The Welford algorithm as written calculates the sample variance in the finalize routine, and nowhere in the description does it specifically call that out. <!-- Template:Unsigned --><small class="autosigned">— Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:192.91.171.34|192.91.171.34]] ([[User talk:192.91.171.34#top|talk]] • [[Special:Contributions/192.91.171.34|contribs]]) </small>
===Parallel algorithm===
Is there an error in <math>M_{2,AB}</math>?
Chan et al. in its paper <ref name=":0">{{Citation
| last1 = Chan | first1 = Tony F. | author1-link = Tony F. Chan
| last2 = Golub | first2 = Gene H. | author2-link = Gene H. Golub
| last3 = LeVeque | first3 = Randall J.
| contribution = Updating Formulae and a Pairwise Algorithm for Computing Sample Variances.
| title = Technical Report STAN-CS-79-773
| publisher = Department of Computer Science, Stanford University
| year = 1979
| contribution-url =http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf }}.</ref> has another relation for combining arbitrary sets <math>A</math> and <math>B</math>:
<math>M_{2,AB} = M_{2,A} + M_{2,B} + \frac{n_A}{n_B n_{AB}} \cdot \left( \frac{n_B}{n_A} \bar x_A - \bar x_B \right)^2 </math>, which yelds to <math>M_{2,AB} = M_{2,A} + M_{2,B} + \frac{1}{n_{AB}} \cdot \left( \bar x_A - \bar x_B \right)^2 </math> if <math>n_A</math> and <math>n_B</math> are equal.
=== Algorithm IV ===
Line 343 ⟶ 357:
: Btw, I don't understand your comment at the beginning since the problematic division is outside the loop. [[User:McKay|McKay]] ([[User talk:McKay|talk]]) 01:02, 10 April 2009 (UTC)
::: It should not say "else variance = 0". That implies the sample variance is 0 when ''n'' = 1. That is incorrect. The sample variance is undefined in that case. [[User:Michael Hardy|Michael Hardy]] ([[User talk:Michael Hardy|talk]]) 01:18, 10 April 2009 (UTC)
{{reflist-talk}}
== Easiest online algorithm yet. ==
Line 393 ⟶ 409:
== Pseudocode is valid Python, but gets tripped up by integer division ==
Would anyone object to a note pointing out that the pseudocode can be run verbatim in a Python intepreter, but that in Python < 3.0 you'll need a <
::I think it would be inappropriate. The whole idea of pseudocode is that it's not language specific. -- [[User:RoySmith|RoySmith]] [[User Talk:RoySmith|(talk)]] 18:52, 25 December 2010 (UTC)
Line 410 ⟶ 426:
: [[User:Michael Hardy|Michael Hardy]] ([[User talk:Michael Hardy|talk]]) 19:28, 25 December 2010 (UTC)
::*Thanks for the pointer to [[Bessel's correction]], that explains this better than I've ever understood it before. And, yes, my intent in asking the question was not so much for my own knowledge (although that's part of it), but as a suggestion that something along these lines be included in this article. -- [[User:RoySmith|RoySmith]] [[User Talk:RoySmith|(talk)]] 03:28, 28 December 2010 (UTC)
:I am confused by the distinction of sampling from a population of size N vs sampling from a smaller subset n<N. The formulas to compute the variance, and its sample estimate, have exactly the same form (for finite N and n). This probably results from some authors referring to "population variance" when actually meaning "variance of the distribution" from where the independent samples originate (for example see Navidi, William (2006) Statistics for Engineers and Scientists, McGraw-Hill, pg 14.).
: The formulas in the article implicitly assume samples to be i.i.d random variables ([[Independent_and_identically_distributed_random_variables]]). The variance of such population of size K (either n or N) involves dividing by K. An unbiased estimate of this variance involves dividing by K if the mean of the distribution is known, or dividing by K-1 if the mean is also estimated from the data.
:This is consistent with the articles:
::* [[Bessel's correction]]
::* [[Variance#Sample_variance]]
:And the references:
::* Knuth, Donald Erwin, Ronald L. Graham, and Oren Patashnik. "Concrete mathematics." Adison Wesley (1989), p373.
::* Cramér, Harald. "Mathematical Methods of Statistics". Princeton university press (1961), p351.
: I believe the main article should be corrected to reflect the probabilistic nature of the population (i.i.d samples) and drop mention of N vs n.
[[User:Mpbrigham|Mpbrigham]] ([[User talk:Mpbrigham|talk]]) 02:35, 26 January 2017 (UTC)
== Weighted variance pseudocode correctness ==
Line 478 ⟶ 508:
== Weighted Incremental Algorithm Alternative Incorrect ==
<
def weighted_incremental_variance(dataWeightPairs):
...
M2 = M2 + sumweight * delta * R # Alternatively, "M2 = M2 + weight * delta * (x−mean)"
...
</syntaxhighlight>
I did the math to see if the stated update formula is indeed correct with the alternative mentioned in the comment; I have found that it is not. As such, I propose the alternative comment be removed.
Line 564 ⟶ 594:
Revision of {{oldid|Algorithms for calculating variance|270157362|12 February 2009}} further "corrected" titles by eliminating roman numerals. Today's revision of {{oldid|Algorithms for calculating variance|707525590}} replaced roman numerical references from the text according to the above map. [[User:Ale2006|ale]] ([[User talk:Ale2006|talk]]) 10:32, 29 February 2016 (UTC)
Issues with the section called "Welford's online algorithm".
1. I don't see the point of these equations:
{\displaystyle s_{n}^{2}={\frac {n-2}{n-1}}\,s_{n-1}^{2}+{\frac {(x_{n}-{\bar {x}}_{n-1})^{2}}{n}}=s_{n-1}^{2}+{\frac {(x_{n}-{\bar {x}}_{n-1})^{2}}{n}}-{\frac {s_{n-1}^{2}}{n-1}},\quad n>1}
{\displaystyle \sigma _{n}^{2}={\frac {(n-1)\,\sigma _{n-1}^{2}+(x_{n}-{\bar {x}}_{n-1})(x_{n}-{\bar {x}}_{n})}{n}}=\sigma _{n-1}^{2}+{\frac {(x_{n}-{\bar {x}}_{n-1})(x_{n}-{\bar {x}}_{n})-\sigma _{n-1}^{2}}{n}}.} {\displaystyle \sigma _{n}^{2}={\frac {(n-1)\,\sigma _{n-1}^{2}+(x_{n}-{\bar {x}}_{n-1})(x_{n}-{\bar {x}}_{n})}{n}}=\sigma _{n-1}^{2}+{\frac {(x_{n}-{\bar {x}}_{n-1})(x_{n}-{\bar {x}}_{n})-\sigma _{n-1}^{2}}{n}}.}
They are just scaled versions of the equation that follows. I recommend removal.
2. The comment "These formulas suffer from numerical instability, as they repeatedly subtract a small number from a big number which scales with n." is confusing. It looks to me like some of the formulas repeatedly add a small number to a large number. In this respect they are no worse than the equation that follows.
3. The advantage of the {\displaystyle M_{2,n}} recurrence is that it is simpler and more computationally efficient because there are fewer multiplications and no divisions.
4. The last two equations should be
{\displaystyle {\begin{aligned}s_{n}^{2}&={\frac {M_{2,n}}{n-1}}\\[4pt]\sigma _{n}^{2}&={\frac {M_{2,n}}{n}}\end{aligned}}}
[[User:ProfRB|ProfRB]] ([[User talk:ProfRB|talk]]) 18:51, 22 March 2019 (UTC)
:2. I think the comment "These formulas suffer from numerical instability, as they repeatedly subtract a small number from a big number which scales with n" is actually wrong. Maybe there is an '''accuracy''' issue, but I don't see why there should be an instability here. A reference would be most welcome. [[User:Natchouf|Natchouf]] ([[User talk:Natchouf|talk]]) 14:08, 20 March 2023 (UTC)
== Typo in "Computing shifted data" section ==
The second shown formula in this section does not compute the population variance <math>\sigma^2</math>, it rather computes the sample variance <math>s^2</math>. This can easily be seen when comparing the (wrong) formula divisor <math>n-1</math> with the respective divisor <math>n</math> as used in the section "Naive algorithm" just above (where capital <math>N</math> is used instead of <math>n</math>).
The comments in the corresponding code snippet below makes the situation a bit clearer. Maybe one could write the code explicitly as
...
// for sample variance use
variance = (Ex2 - Ex**2 / n) / (n - 1)
// for population variance use
// variance = (Ex2 - Ex**2 / n) / n
...
[[Special:Contributions/141.249.133.134|141.249.133.134]] ([[User talk:141.249.133.134|talk]]) 06:17, 10 April 2024 (UTC)
|