Talk:Algorithms for calculating variance: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Add topic

Revision as of 21:27, 27 October 2019 edit Fr.dror (talk \| contribs) 253 edits mNo edit summary ← Previous edit		Latest revision as of 11:22, 23 December 2024 edit undo Graham87 (talk \| contribs) Autopatrolled, Event coordinators, Extended confirmed users, Page movers, Importers, Rollbackers, Temporary account IP viewers 297,887 edits m 2 revisions imported: import old edits from "Algorithms for calculating variance/Talk" in the August 2001 database dump
(7 intermediate revisions by 7 users not shown)
Line 1: {{WikiProject ~~Statistics~~banner shell\| class = ~~start~~ Start\| ~~importance = mid }}~~ {{WikiProject Statistics\| importance = mid }} ~~{{maths rating \|frequentlyviewed = yes \| class = start \| importance = mid \| field = probability and statistics}}~~ {{WikiProject Mathematics\| importance = mid }} }} == Online algorithm in testing yields horrible results when mean=~0 == Line 7 ⟶ 8: Most all the tests I've seen of these algorithms add some unrealistic constant (i.e. 10^6 or larger) to the dataset to demonstrate that the suggested algorithm on this page is indeed better. I naively used this algorithm in my own work, to horrible effect. My dataset consists of a large number of discrete values, perhaps with the values -1, 0, or 1, and with an average usually between -1 and 1. I wrote the following simple test program to demonstrate the difference in results between what I'll call METHOD1 (using a running sum and sum of squares of the dataset) and METHOD2 (using a running computation of the average and the variance, which the current wiki strongly recommends). <~~source~~syntaxhighlight lang="cpp"> #include <iostream> #include <iomanip> Line 78 ⟶ 79: return 0; } </syntaxhighlight> ~~</source>~~ And here sample output: <~~source~~syntaxhighlight lang="text"> numtrials float avg 2 float avg 1 double avg 2 double avg 1 float std 2 float std 1 double std 2 double std 1 1000000 0.948275 0.950115 0.950115 0.950115 0.218147 0.217707 0.217707 0.217707 Line 123 ⟶ 124: 38000000 0.816234 0.441506 0.950036 0.950036 0.195328 0.496567 0.217871 0.217871 39000000 0.816234 0.430185 0.950032 0.950032 0.194878 0.495102 0.217878 0.217878 </syntaxhighlight> ~~</source>~~ Note how the computed average using float's and method 2 fails to six digits accuracy before even 1 million trials, while method 1 using floats reproduces the double results all the way out to 17 million trials. <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures\|unsigned]] comment added by [[Special:Contributions/173.219.85.45\|173.219.85.45]] ([[User talk:173.219.85.45\|talk]]) 19:41, 19 October 2012 (UTC)</span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot--> Line 244 ⟶ 245: :The Welford algorithm as written calculates the sample variance in the finalize routine, and nowhere in the description does it specifically call that out. <!-- Template:Unsigned --><small class="autosigned">— Preceding [[Wikipedia:Signatures\|unsigned]] comment added by [[User:192.91.171.34\|192.91.171.34]] ([[User talk:192.91.171.34#top\|talk]] • [[Special:Contributions/192.91.171.34\|contribs]]) </small> ===Parallel algorithm=== Is there an error in <math>M_{2,AB}</math>? Chan et al. in its paper <ref name=":0">{{Citation \| last1 = Chan \| first1 = Tony F. \| author1-link = Tony F. Chan \| last2 = Golub \| first2 = Gene H. \| author2-link = Gene H. Golub \| last3 = LeVeque \| first3 = Randall J. \| contribution = Updating Formulae and a Pairwise Algorithm for Computing Sample Variances. \| title = Technical Report STAN-CS-79-773 \| publisher = Department of Computer Science, Stanford University \| year = 1979 \| contribution-url =http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf }}.</ref> has another relation for combining arbitrary sets <math>A</math> and <math>B</math>: <math>M_{2,AB} = M_{2,A} + M_{2,B} + \frac{n_A}{n_B n_{AB}} \cdot \left( \frac{n_B}{n_A} \bar x_A - \bar x_B \right)^2 </math>, which yelds to <math>M_{2,AB} = M_{2,A} + M_{2,B} + \frac{1}{n_{AB}} \cdot \left( \bar x_A - \bar x_B \right)^2 </math> if <math>n_A</math> and <math>n_B</math> are equal. === Algorithm IV === Line 343 ⟶ 357: : Btw, I don't understand your comment at the beginning since the problematic division is outside the loop. [[User:McKay\|McKay]] ([[User talk:McKay\|talk]]) 01:02, 10 April 2009 (UTC) ::: It should not say "else variance = 0". That implies the sample variance is 0 when ''n'' = 1. That is incorrect. The sample variance is undefined in that case. [[User:Michael Hardy\|Michael Hardy]] ([[User talk:Michael Hardy\|talk]]) 01:18, 10 April 2009 (UTC) {{reflist-talk}} == Easiest online algorithm yet. == Line 393 ⟶ 409: == Pseudocode is valid Python, but gets tripped up by integer division == Would anyone object to a note pointing out that the pseudocode can be run verbatim in a Python intepreter, but that in Python < 3.0 you'll need a <~~source~~syntaxhighlight lang="python">from __future__ import division</~~source~~syntaxhighlight> statement. (Can you tell I just got caught out by this? :D) [[User:Alexsdutton\|Alexsdutton]] ([[User talk:Alexsdutton\|talk]]) 15:05, 23 December 2010 (UTC) ::I think it would be inappropriate. The whole idea of pseudocode is that it's not language specific. -- [[User:RoySmith\|RoySmith]] [[User Talk:RoySmith\|(talk)]] 18:52, 25 December 2010 (UTC) Line 492 ⟶ 508: == Weighted Incremental Algorithm Alternative Incorrect == <~~source~~syntaxhighlight lang="python"> def weighted_incremental_variance(dataWeightPairs): ... M2 = M2 + sumweight * delta * R # Alternatively, "M2 = M2 + weight * delta * (x−mean)" ... </syntaxhighlight> ~~</source>~~ I did the math to see if the stated update formula is indeed correct with the alternative mentioned in the comment; I have found that it is not. As such, I propose the alternative comment be removed. Line 592 ⟶ 608: {\displaystyle {\begin{aligned}s_{n}^{2}&={\frac {M_{2,n}}{n-1}}\\[4pt]\sigma _{n}^{2}&={\frac {M_{2,n}}{n}}\end{aligned}}} [[User:ProfRB\|ProfRB]] ([[User talk:ProfRB\|talk]]) 18:51, 22 March 2019 (UTC) :2. I think the comment "These formulas suffer from numerical instability, as they repeatedly subtract a small number from a big number which scales with n" is actually wrong. Maybe there is an '''accuracy''' issue, but I don't see why there should be an instability here. A reference would be most welcome. [[User:Natchouf\|Natchouf]] ([[User talk:Natchouf\|talk]]) 14:08, 20 March 2023 (UTC) == Typo in "Computing shifted data" section == The second shown formula in this section does not compute the population variance <math>\sigma^2</math>, it rather computes the sample variance <math>s^2</math>. This can easily be seen when comparing the (wrong) formula divisor <math>n-1</math> with the respective divisor <math>n</math> as used in the section "Naive algorithm" just above (where capital <math>N</math> is used instead of <math>n</math>). The comments in the corresponding code snippet below makes the situation a bit clearer. Maybe one could write the code explicitly as ... // for sample variance use variance = (Ex2 - Ex2 / n) / (n - 1) // for population variance use // variance = (Ex2 - Ex2 / n) / n ... [[Special:Contributions/141.249.133.134\|141.249.133.134]] ([[User talk:141.249.133.134\|talk]]) 06:17, 10 April 2024 (UTC)