Elbow method (clustering): Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 14:54, 8 February 2024 edit Chire (talk \| contribs) Extended confirmed users 1,803 edits Adding the requested image, indeed this is easy to generate. ← Previous edit		Latest revision as of 17:59, 25 May 2025 edit undo OAbot (talk \| contribs) Bots 646,409 edits m Open access bot: url-access updated in citation with #oabot.
(3 intermediate revisions by 3 users not shown)
Line 31: \| url = http://www3.interscience.wiley.com/cgi-bin/fulltext/17435/PDFSTART \| doi = 10.1002/(SICI)1097-0266(199606)17:6<441::AID-SMJ819>3.0.CO;2-G \| url-access = subscription }}{{dead link\|date=February 2019\|bot=medic}}{{cbignore\|bot=medic}}</ref> This can even hold in cases where all other methods for [[determining the number of clusters in a data set]] (as mentioned in that article) agree on the number of clusters. [[File:"Elbow" in Inertia on uniform data.png\|thumb\|alt=Plot of the sum of squared errors (SSE) as k increases, following a typical 1/k shape.\|Example of the typical "elbow" pattern used for choosing the number of clusters even emerging on uniform data.]] Even on uniform random data (with no meaningful clusters) the curve follows approximately the ratio ''1/k'' where ''k'' is the number of clusters parameter, causing users to see an "elbow" to mistakenly choose some "optimal" number of clusters.<ref name=":0" /> Because the two axes (the number of clusters and the remaining variance) have no semantic relationship, various attempt to capture the elbow by "slope" are ill-defined and sensitive to the parameter range.<ref name=":0">{{Cite journal \|last=Schubert \|first=Erich \|date=2023-07-05 \|title=Stop using the elbow criterion for k-means and how to choose the number of clusters instead \|url=https://doi.org/10.1145/3606274.3606278 \|journal=ACM SIGKDD Explorations Newsletter \|volume=25 \|issue=1 \|pages=36–42 \|doi=10.1145/3606274.3606278 \|issn=1931-0145\|arxiv=2212.12189 }}</ref> Increasing the maximum number of clusters can change the ___location of the perceived "elbow", and in many cases alternate heuristics such as the [[Calinski–Harabasz index\|variance-~~rario~~ratio-criterion]] or the [[Silhouette (clustering)\|average silhouette width]] are considered to be more reliable.<ref name=":0" /> But even with such measures, the results may depend much on the data preprocessing (feature selection and scaling) and users may come to very different clustering results on the same data. == Measures of variation ==