Elbow method (clustering): Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 10:45, 16 February 2019 edit Citation bot (talk \| contribs) Bots 5,870,965 edits m Add: citeseerx. \| You can use this bot yourself. Report bugs here. \| User-activated. ← Previous edit		Latest revision as of 17:59, 25 May 2025 edit undo OAbot (talk \| contribs) Bots 646,409 edits m Open access bot: url-access updated in citation with #oabot.
(30 intermediate revisions by 26 users not shown)
Line 1: {{Short description\|Heuristic used in computer science}} The '''Elbow method''' is a method of interpretation and validation of consistency within [[cluster analysis]] designed to help finding the [[Determining the number of clusters in a data set\|appropriate number of clusters in a dataset]]. [[Image:DataClustering ElbowCriterion.JPG\|thumb\|right\|300px\|Explained ~~Variance~~variance. The "elbow" is indicated by the red circle. The number of clusters chosen should therefore be 4.]]▼ In [[cluster analysis]], the '''elbow method''' is a [[heuristic]] used in [[determining the number of clusters in a data set]]. The method consists of plotting the [[explained variation]] as a function of the number of clusters and picking the [[elbow of the curve]] as the number of clusters to use. The same method can be used to choose the number of parameters in other data-driven models, such as the number of [[principal component]]s to describe a data set. The method can be traced to speculation by [[Robert L. Thorndike]] in 1953.<ref>{{Cite journal▼ ▲[[Image:DataClustering ElbowCriterion.JPG\|thumb\|right\|300px\|Explained Variance. The "elbow" is indicated by the red circle. The number of clusters chosen should therefore be 4.]] \| author = [[Robert L. Thorndike]]▼ ~~This method looks at the percentage of variance explained as a function of the number of clusters:~~ \| title = Who Belongs in the Family?▼ ~~One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.~~ \| journal = [[Psychometrika]]▼ More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the "elbow criterion". \| volume = 18▼ ~~This "elbow" cannot always be unambiguously identified.<ref>See, e.g., {{Cite journal~~ \| issue = 4▼ \|author1=David J. Ketchen, Jr \|author2=Christopher L. Shook \| title = The application of cluster analysis in Strategic Management Research: An analysis and critique▼ \| pages = ~~298–310~~267–276▼ \| date = December 1953▼ \| doi = 10.1007/BF02289263▼ \| s2cid = 120467216 }}</ref>▼ == Intuition == Using the "elbow" or "[[knee of a curve]]" as a cutoff point is a common heuristic in [[mathematical optimization]] to choose a point where [[diminishing returns]] are no longer worth the additional cost. In clustering, this means one should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. The intuition is that increasing the number of clusters will naturally improve the fit (explain more of the variation), since there are more parameters (more clusters) to use, but that at some point this is [[over-fitting]], and the elbow reflects this. For example, given data that actually consist of ''k'' labeled groups – for example, ''k'' points sampled with noise – clustering with more than ''k'' clusters will "explain" more of the variation (since it can use smaller, tighter clusters), but this is over-fitting, since it is subdividing the labeled groups into multiple clusters. The idea is that the first clusters will add much information (explain a lot of variation), since the data actually consist of that many groups (so these clusters are necessary), but once the number of clusters exceeds the actual number of groups in the data, the added information will drop sharply, because it is just subdividing the actual groups. Assuming this happens, there will be a sharp elbow in the graph of explained variation versus clusters: increasing rapidly up to ''k'' ([[under-fitting]] region), and then increasing slowly after ''k'' (over-fitting region). == Criticism == The elbow method is considered both subjective and unreliable. In many practical applications, the choice of an "elbow" is highly ambiguous as the plot does not contain a sharp elbow.<ref>See, e.g., {{Cite journal ▲ \|~~author1~~first1=David J. \|last1=Ketchen, Jr \|~~author2~~first2=Christopher L. \|last2=Shook \| title = The application of cluster analysis in Strategic Management Research: An analysis and critique \| journal = [[Strategic Management Journal]] \| volume = 17 \| issue = 6 \| pages = ~~441–458~~441–458 \| year = 1996 \| url = http://www3.interscience.wiley.com/cgi-bin/fulltext/17435/PDFSTART \| doi = 10.1002/(SICI)1097-0266(199606)17:6<441::AID-SMJ819>3.0.CO;2-G \| url-access = subscription ▲}}</ref> }}{{dead link\|date=February 2019\|bot=medic}}{{cbignore\|bot=medic}}</ref> Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an [[F-test]]. A slight variation of this method plots the curvature of the within group variance.<ref>See, e.g., Figure 6 in This can even hold in cases where all other methods for [[determining the number of clusters in a data set]] (as mentioned in that article) agree on the number of clusters. [[File:Elbow in Inertia on uniform data.png\|thumb\|alt=Plot of the sum of squared errors (SSE) as k increases, following a typical 1/k shape.\|Example of the typical "elbow" pattern used for choosing the number of clusters even emerging on uniform data.]] Even on uniform random data (with no meaningful clusters) the curve follows approximately the ratio ''1/k'' where ''k'' is the number of clusters parameter, causing users to see an "elbow" to mistakenly choose some "optimal" number of clusters.<ref name=":0" /> Because the two axes (the number of clusters and the remaining variance) have no semantic relationship, various attempt to capture the elbow by "slope" are ill-defined and sensitive to the parameter range.<ref name=":0">{{Cite journal \|last=Schubert \|first=Erich \|date=2023-07-05 \|title=Stop using the elbow criterion for k-means and how to choose the number of clusters instead \|url=https://doi.org/10.1145/3606274.3606278 \|journal=ACM SIGKDD Explorations Newsletter \|volume=25 \|issue=1 \|pages=36–42 \|doi=10.1145/3606274.3606278 \|issn=1931-0145\|arxiv=2212.12189 }}</ref> Increasing the maximum number of clusters can change the ___location of the perceived "elbow", and in many cases alternate heuristics such as the [[Calinski–Harabasz index\|variance-ratio-criterion]] or the [[Silhouette (clustering)\|average silhouette width]] are considered to be more reliable.<ref name=":0" /> But even with such measures, the results may depend much on the data preprocessing (feature selection and scaling) and users may come to very different clustering results on the same data. == Measures of variation == There are various measures of "[[explained variation]]" used in the elbow method. Most commonly, variation is quantified by ''[[variance]]'', and the ratio used is the ratio of between-group variance to the total variance. Alternatively, one uses the ratio of between-group variance to within-group variance, which is the one-way [[ANOVA]] [[F-test statistic\|''F''-test statistic]].<ref>See, e.g., Figure 6 in * {{Cite journal \| first1 = Cyril \| last1 = Goutte ~~\| author = Cyril Goutte, Peter Toft, Egill Rostrup, Finn Årup Nielsen, [[Lars Kai Hansen]]~~ \| first2 = Peter \| last2 = Toft \| first3 = Egill \| last3 = Rostrup \| first4 = Finn Årup \| last4 = Nielsen \| first5 = Lars Kai \| last5 = Hansen \| title = On Clustering fMRI Time Series \| journal = [[NeuroImage]] \|date=March 1999 \| volume = 9 \| issue = 3 \| pages = ~~267–276~~298–310▼ ▲ \| pages = 298–310 \| doi = 10.1006/nimg.1998.0391 \| pmid = 10075900 \| citeseerx = 10.1.1.29.2679 \| s2cid = 14147564 }}</ref> ▲The method can be traced to speculation by [[Robert L. Thorndike]] in 1953.<ref>{{Cite journal ▲ \| author = [[Robert L. Thorndike]] ▲ \| title = Who Belongs in the Family? ▲ \| journal = [[Psychometrika]] ▲ \| volume = 18 ▲ \| issue = 4 ▲ \| pages = 267–276 ▲ \|date=December 1953 ▲ \| doi = 10.1007/BF02289263 ~~}}</ref>~~ == See also == * [[Determining the number of clusters in a data set]] * [[~~Silhouette~~Scree ~~(clustering)~~plot]] == References == {{reflist}} ~~<references/>~~ [[Category:Clustering criteria]] {{comp-sci-stub}}