Talk:Determining the number of clusters in a data set

This is an old revision of this page, as edited by Bluedevil.knight (talk | contribs) at 19:14, 1 November 2012 (Elbow : not equivalent to the F test). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Latest comment: 12 years ago by Bluedevil.knight in topic Elbow : not equivalent to the F test
WikiProject iconStatistics Unassessed
WikiProject iconThis article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
???This article has not yet received a rating on Wikipedia's content assessment scale.
???This article has not yet received a rating on the importance scale.

Additional updates coming

A colleague will be adding details to the "Elbow method" and "Information criteria" subsections shortly. -JohnMeier (talk) 15:10, 9 April 2009 (UTC)Reply

Not that common problem

There are lots of alternative algorithms that do not require the specification of k beforehand. This is mostly a problem of k-means, k-medoids and the EM-algorithm. Pretty much none of the more recent algorithms has this parameter. --Chire2 (talk) 14:13, 7 May 2010 (UTC)Reply

Any examples for such algorithms? thanks. Talgalili (talk) 12:36, 20 June 2010 (UTC)Reply

A well known, early example is the AutoClass algorithm, by Cheeseman et al. 1988, which applied a search-based method built around Expectation Maximization to find the Maximum A-Posteriori distribution as a function of the number of classes. More modern approaches to this problem would equivalently apply the Bayes Information Criterion to selecting k. Johnmark54 (talk) 15:26, 5 October 2011 (UTC)Reply

Spectral Methods

Spectral methods automatically give k for many datasets. — Preceding unsigned comment added by 192.249.47.174 (talk) 15:38, 21 June 2012 (UTC)Reply


REference to such methods please? — Preceding unsigned comment added by 152.16.225.159 (talk) 19:24, 31 October 2012 (UTC)Reply

Information and text

Information theoretic section is disproportionately long, it should be edited down to be commensurate with the others.

Also, I moved the heuristic about textual clustering down, as it is specialized and not of very general interest (compared to, say, the elbow method). — Preceding unsigned comment added by Bluedevil.knight (talkcontribs) 14:46, 1 November 2012 (UTC)Reply

Elbow : not equivalent to the F test

Original says: " Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an F-test. " This seems wrong. For one, the F-test is not this ratio. The F-statistic is between-group variance over within-group variance, which does not give percent of variance explained. Percentage of total variance explained is between-group variance over total variance. IF nobody refutes this, I will modify the original. Bluedevil.knight (talk) 19:13, 1 November 2012 (UTC)Reply