Nearest-neighbor chain algorithm: Difference between revisions

Content deleted Content added
m word choice
Tags: Reverted Visual edit
E992481 (talk | contribs)
Link suggestions feature: 3 links added.
 
(9 intermediate revisions by 6 users not shown)
Line 1:
{{Short description|YuvrajStack-based Singhmethod isfor an Indian manister and writers}}{{good articleclustering}}
{{good article}}
In the theory of [[cluster analysis]], the '''nearest-neighbor chain algorithm''' is an [[algorithm]] that can speed up several methods for [[agglomerative hierarchical clustering]]. These are methods that take a collection of points as input, and create a hierarchy of clusters of points by repeatedly merging pairs of smaller clusters to form larger clusters. The clustering methods that the nearest-neighbor chain algorithm can be used for include [[Ward's method]], [[complete-linkage clustering]], and [[single-linkage clustering]]; these all work by repeatedly merging the closest two clusters but use different definitions of the distance between clusters. The cluster distances for which the nearest-neighbor chain algorithm works are called ''reducible'' and are characterized by a simple inequality among certain cluster distances.
 
Line 8 ⟶ 9:
==Background==
[[File:Hierarchical clustering diagram.png|thumb|upright=1.35|A hierarchical clustering of six points. The points to be clustered are at the top of the diagram, and the nodes below them represent clusters.]]
Many problems in [[data analysis]] concern [[Cluster analysis|clustering]], grouping data items into clusters of closely related items. [[Hierarchical clustering]] is a version of cluster analysis in which the clusters form a hierarchy or tree-like structure rather than a strict partition of the data items. In some cases, this type of clustering may be performed as a way of performing cluster analysis at multiple different scales simultaneously. In others, the data to be analyzed naturally has an unknown tree structure and the goal is to recover that structure by performing the analysis. Both of these kinds of analysis can be seen, for instance, in the application of hierarchical clustering to [[Taxonomy (biology)|biological taxonomy]]. In this application, different living things are grouped into clusters at different scales or levels of similarity ([[Taxonomic rank|species, genus, family, etc]]). This analysis simultaneously gives a multi-scale grouping of the organisms of the present age, and aims to accurately reconstruct the [[branching process]] or [[Phylogenetic tree|evolutionary tree]] that in past ages produced these organisms.<ref>{{citation
| last = Gordon | first = Allan D.
| editor1-last = Arabie | editor1-first = P.
Line 35 ⟶ 36:
| arxiv = cs.DS/9912014
| issue = 1
| journal = J. ACM Journal of Experimental Algorithmics
| pages = 1–23
| publisher = ACM
Line 41 ⟶ 42:
| url = http://www.jea.acm.org/2000/EppsteinDynamic/
| volume = 5
| year = 2000| doi = 10.1145/351827.351829 | bibcode = 1999cs.......12014E | s2cid = 1357701 }}.</ref><ref name="day-edels">{{citation
| last1 = Day | first1 = William H. E.
| last2 = Edelsbrunner | first2 = Herbert | author2-link = Herbert Edelsbrunner
Line 51 ⟶ 52:
| url = http://www.cs.duke.edu/~edels/Papers/1984-J-05-HierarchicalClustering.pdf
| volume = 1
| year = 1984| s2cid = 121201396
| year = 1984}}.</ref> The nearest-neighbor chain algorithm uses a smaller amount of time and space than the greedy algorithm by merging pairs of clusters in a different order. In this way, it avoids the problem of repeatedly finding closest pairs. Nevertheless, for many types of clustering problem, it can be guaranteed to come up with the same hierarchical clustering as the greedy algorithm despite the different merge order.<ref name="murtagh-tcj"/>
 
==The algorithm==
[[File:Nearest-neighbor chain algorithm animated.gif|frame|300px|alt=Animated execution of Nearest-neighbor chain algorithm|Animation of the algorithm using Ward's distance. Black dots are points, grey regions are larger clusters, blue arrows point to nearest neighbors, and the red bar indicates the current chain. For visual simplicity, when a merge leaves the chain empty, it continues with the recently merged cluster.]]
Intuitively, the nearest neighbor chain algorithm repeatedly follows a chain of clusters {{math|''A'' → ''B'' → ''C'' → ...}} where each cluster is the nearest neighbor of the previous one, until reaching a pair of clusters that are mutual nearest neighbors.<ref name="murtagh-tcj">{{citation
| last = Murtagh | first = Fionn
Line 69 ⟶ 71:
| editor1-last = Abello | editor1-first = James M.
| editor2-last = Pardalos | editor2-first = Panos M.
| editor3-last = Resende | editor3-first = Mauricio G. C. | editor3-link = Mauricio Resende
| contribution = Clustering in massive data sets
| isbn = 978-1-4020-0489-6
Line 92 ⟶ 94:
 
==Time and space analysis==
Each iteration of the loop performs a single search for the nearest neighbor of a cluster, and either adds one cluster to the stack or removes two clusters from it. EachEvery cluster is only ever added once to the stack, because when it is removed again it is immediately made inactive and merged. There are a total of {{math|2''n'' &minus; 2}} clusters that ever get added to the stack: {{math|''n''}} single-point clusters in the initial set, and {{math|''n'' &minus; 2}} internal nodes other than the root in the binary tree representing the clustering. Therefore, the algorithm performs {{math|2''n'' &minus; 2}} pushing iterations and {{math|''n'' &minus; 1}} popping iterations.<ref name="murtagh-tcj"/>
 
Each of these iterations may spend time scanning as many as {{math|''n'' &minus; 1}} inter-cluster distances to find the nearest neighbor.
Line 198 ⟶ 200:
 
===Centroid distance===
Another distance measure commonly used in agglomerative clustering is the distance between the centroids of pairs of clusters, also known as the weighted group method.<ref name="mirkin"/><ref name="lance-williams"/> It can be calculated easily in constant time per distance calculation. However, it is not reducible. For instance, if the input forms the set of three points of an [[equilateral triangle]], merging two of these points into a larger cluster causes the inter-cluster distance to decrease, a violation of reducibility. Therefore, the nearest-neighbor chain algorithm will not necessarily find the same clustering as the greedy algorithm. Nevertheless, {{harvtxt|Murtagh|1983}} writes that the nearest-neighbor chain algorithm provides "a good [[heuristic]]" for the centroid method.<ref name="murtagh-tcj"/>
A different algorithm by {{harvtxt|Day|Edelsbrunner|1984}} can be used to find the greedy clustering in {{math|''O''(''n''<sup>2</sup>)}} time for this distance measure.<ref name="day-edels"/>