Nearest-neighbor chain algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 16:41, 11 February 2025 edit 2a02:3100:3c9b:9500:c880:b670:2e09:30c5 (talk) →Time and space analysis: Complexity of variants Tags: Reverted Mobile edit Mobile web edit ← Previous edit		Latest revision as of 12:31, 2 July 2025 edit undo E992481 (talk \| contribs) 69 edits Link suggestions feature: 3 links added. Tags: Visual edit Newcomer task Suggested: add links
(2 intermediate revisions by 2 users not shown)
Line 9: ==Background== [[File:Hierarchical clustering diagram.png\|thumb\|upright=1.35\|A hierarchical clustering of six points. The points to be clustered are at the top of the diagram, and the nodes below them represent clusters.]] Many problems in [[data analysis]] concern [[Cluster analysis\|clustering]], grouping data items into clusters of closely related items. [[Hierarchical clustering]] is a version of cluster analysis in which the clusters form a hierarchy or tree-like structure rather than a strict partition of the data items. In some cases, this type of clustering may be performed as a way of performing cluster analysis at multiple different scales simultaneously. In others, the data to be analyzed naturally has an unknown tree structure and the goal is to recover that structure by performing the analysis. Both of these kinds of analysis can be seen, for instance, in the application of hierarchical clustering to [[Taxonomy (biology)\|biological taxonomy]]. In this application, different living things are grouped into clusters at different scales or levels of similarity ([[Taxonomic rank\|species, genus, family, etc]]). This analysis simultaneously gives a multi-scale grouping of the organisms of the present age, and aims to accurately reconstruct the [[branching process]] or [[Phylogenetic tree\|evolutionary tree]] that in past ages produced these organisms.<ref>{{citation \| last = Gordon \| first = Allan D. \| editor1-last = Arabie \| editor1-first = P. Line 36: \| arxiv = cs.DS/9912014 \| issue = 1 \| journal = J. ACM Journal of Experimental Algorithmics \| pages = 1–23 \| publisher = ACM Line 100: For the same reason, the total time used by the algorithm outside of these distance calculations is {{math\|O(''n''<sup>2</sup>)}}.<ref name="murtagh-tcj"/> Since the only data structure is the set of active clusters and the stack containing a subset of the active clusters, the space required is linear in the number of input points.<ref name="murtagh-tcj"/> However, an implementation using a pairwise distance matrix is often smaller, as every distance then is only computed once, and Lance-Williams-Equations can be used to reduce the cost of computing distances between clusters, at the cost of increasing memory requirements to quadratic memory. For the special case of squared Euclidean distance, cluster centers can replace the set of data points and also reduce the number of distance computations, while keepimg the memory usage linear. Using index structures to accelerate nearest neighbor search - instead of scanning all other clusters - the algorithm can be used on many real data sets in subquadratic time with linear memory, however the theoretical worst case complexity does not improve and remains quadratic. ==Correctness== Line 201 ⟶ 200: ===Centroid distance=== Another distance measure commonly used in agglomerative clustering is the distance between the centroids of pairs of clusters, also known as the weighted group method.<ref name="mirkin"/><ref name="lance-williams"/> It can be calculated easily in constant time per distance calculation. However, it is not reducible. For instance, if the input forms the set of three points of an [[equilateral triangle]], merging two of these points into a larger cluster causes the inter-cluster distance to decrease, a violation of reducibility. Therefore, the nearest-neighbor chain algorithm will not necessarily find the same clustering as the greedy algorithm. Nevertheless, {{harvtxt\|Murtagh\|1983}} writes that the nearest-neighbor chain algorithm provides "a good [[heuristic]]" for the centroid method.<ref name="murtagh-tcj"/> A different algorithm by {{harvtxt\|Day\|Edelsbrunner\|1984}} can be used to find the greedy clustering in {{math\|''O''(''n''<sup>2</sup>)}} time for this distance measure.<ref name="day-edels"/>