Nearest-neighbor chain algorithm: Difference between revisions

Content deleted Content added
clean up, , ISBN format using AWB
E992481 (talk | contribs)
Link suggestions feature: 3 links added.
 
(37 intermediate revisions by 14 users not shown)
Line 1:
{{Short description|Stack-based method for clustering}}
In the theory of [[cluster analysis]], the '''nearest-neighbor chain algorithm''' is an [[algorithm]] that can be used to perform several types of [[agglomerative hierarchical clustering]], in which a hierarchy of clusters is created by repeatedly merging pairs of smaller clusters to form larger clusters. In particular it can be used for [[Ward's method]], [[complete-linkage clustering]], and [[single-linkage clustering]], which all work by merging the closest two clusters under different definitions of the distance between clusters.
{{good article}}
In the theory of [[cluster analysis]], the '''nearest-neighbor chain algorithm''' is an [[algorithm]] that can bespeed used to performup several typesmethods offor [[agglomerative hierarchical clustering]]. These are methods that take a collection of points as input, inand whichcreate a hierarchy of clusters isof createdpoints by repeatedly merging pairs of smaller clusters to form larger clusters. InThe particularclustering methods that the nearest-neighbor chain italgorithm can be used for include [[Ward's method]], [[complete-linkage clustering]], and [[single-linkage clustering]],; whichthese all work by repeatedly merging the closest two clusters underbut use different definitions of the distance between clusters. The cluster distances for which the nearest-neighbor chain algorithm works are called ''reducible'' and are characterized by a simple inequality among certain cluster distances.
 
The main idea of the algorithm is to find pairs of clusters to merge by following [[Path (graph theory)|paths]] in the [[nearest neighbor graph]] of the clusters. untilEvery thesuch pathspath will eventually terminate inat pairsa pair of mutualclusters that are nearest neighbors. Thatof paireach ofother, clustersand isthe thenalgorithm chosenchooses that pair of clusters as the pair to merge. In order to save work by there-using algorithm.as Themuch as possible of each path, the algorithm uses a [[Stack (abstract data type)|stack data structure]] to keep track of each path that it follows. By following paths in this way, the nearest-neighbor chain algorithm merges its clusters in a different order than methods that always find and merge mutuallythe nearestclosest pairspair of clusters. efficientlyHowever, despite that difference, it always generates the same hierarchy of clusters.
The resulting system of merges occurs in a different order than in a naive implementation of the same clustering algorithms, but can be shown to generate the same hierarchy of clusters.
 
The nearest-neighbor chain algorithm determinesconstructs a clustering in time quadraticproportional into the square of the number of points to be clustered. This timeis boundalso isproportional linearto inthe size of its input, when the input sizeis forprovided anin inputthe givenform asof an explicit [[distance matrix]]. The methodalgorithm uses an amount of memory thatproportional is linear into the number of points, towhen beit clustered,is used for clustering methods such as Ward's method that allow constant-time calculation of the distance between clusters. However, for some other clustering methods it uses a larger amount of memory mustin bean usedauxiliary todata maintainstructure anwith arraywhich it keeps track of the distances between pairs of clusters.
 
==Background==
[[File:Hierarchical clustering diagram.png|thumb|upright=1.35|A hierarchical clustering of six points. The points to be clustered are at the top of the diagram, and the nodes below them represent clusters.]]
Many problems in [[data analysis]] concern [[Cluster analysis|clustering]], grouping data items into clusters of closely related items. [[Hierarchical clustering]] is a version of cluster analysis in which the clusters form a hierarchy or tree-like structure rather than a strict partition of the data items. In some cases, this type of clustering may be performed as a way of performing cluster analysis at multiple different scales simultaneously. In others, the data to be analyzed naturally has an unknown tree structure and the goal is to recover that structure by performing the analysis. Both of these kinds of analysis can be seen, for instance, in the application of hierarchical clustering to [[Taxonomy (biology)|biological taxonomy]]. In this application, different living things are grouped into clusters at different scales or levels of similarity ([[Taxonomic rank|species, genus, family, etc]]). This analysis simultaneously gives a multi-scale grouping of the organisms of the present age, and aims to accurately reconstruct the [[branching process]] or [[Phylogenetic tree|evolutionary tree]] that in past ages produced these organisms.<ref>{{citation
| last = Gordon | first = Allan D.
| editor1-last = Arabie | editor1-first = P.
| editor2-last = Hubert | editor2-first = L. J.
| editor3-last = De Soete | editor3-first = G.
| contribution = Hierarchical clustering
| contribution-url = https://books.google.com/books?id=HbfsCgAAQBAJ&pg=PA65
| isbn = 9789814504539
| ___location = River Edge, NJ
| pages = 65–121
| publisher = World Scientific
| title = Clustering and Classification
| year = 1996}}.</ref>
 
The input to a clustering problem consists of a set of points.<ref name="murtagh-tcj"/> A ''cluster'' is any proper subset of the points, and a hierarchical clustering is a [[maximal element|maximal]] family of clusters with the property that any two clusters in the family are either nested or [[disjoint set|disjoint]].
Alternatively, a hierarchical clustering may be represented as a [[binary tree]] with the points at its leaves; the clusters of the clustering are the sets of points in subtrees descending from each node of the tree.<ref>{{citation|title=Clustering|volume=10|series=IEEE Press Series on Computational Intelligence|first1=Rui|last1=Xu|first2=Don|last2=Wunsch|publisher=John Wiley & Sons|year=2008|isbn=978-0-470-38278-3|page=31|contribution-url=https://books.google.com/books?id=kYC3YCyl_tkC&pg=PA31|contribution=3.1 Hierarchical Clustering: Introduction}}.</ref>
Line 14 ⟶ 29:
The distance or dissimilarity should be symmetric: the distance between two points does not depend on which of them is considered first.
However, unlike the distances in a [[metric space]], it is not required to satisfy the [[triangle inequality]].<ref name="murtagh-tcj"/>
 
Next, the dissimilarity function is extended from pairs of points to pairs of clusters. Different clustering methods perform this extension in different ways. For instance, in the [[single-linkage clustering]] method, the distance between two clusters is defined to be the minimum distance between any two points from each cluster. Given this distance between clusters, a hierarchical clustering may be defined by a [[greedy algorithm]] that initially places each point in its own single-point cluster and then repeatedly forms a new cluster by merging the [[closest pair]] of clusters.<ref name="murtagh-tcj"/>
 
Line 22 ⟶ 36:
| arxiv = cs.DS/9912014
| issue = 1
| journal = J. ACM Journal of Experimental Algorithmics
| pages = 1–23
| publisher = ACM
Line 28 ⟶ 42:
| url = http://www.jea.acm.org/2000/EppsteinDynamic/
| volume = 5
| year = 2000| doi = 10.1145/351827.351829 | bibcode = 1999cs.......12014E | s2cid = 1357701 }}.</ref><ref name="day-edels">{{citation
| last1 = Day | first1 = William H. E.
| last2 = Edelsbrunner | first2 = Herbert | author2-link = Herbert Edelsbrunner
Line 38 ⟶ 52:
| url = http://www.cs.duke.edu/~edels/Papers/1984-J-05-HierarchicalClustering.pdf
| volume = 1
| year = 1984| s2cid = 121201396
| year = 1984}}.</ref> The nearest-neighbor chain algorithm uses a smaller amount of time and space than the greedy algorithm by merging pairs of clusters in a different order. In this way, it avoids the problem of repeatedly finding closest pairs. Nevertheless, for many types of clustering problem, it can be guaranteed to come up with the same hierarchical clustering as the greedy algorithm despite the different merge order.<ref name="murtagh-tcj"/>
 
==The algorithm==
[[File:Nearest-neighbor chain algorithm animated.gif|frame|300px|alt=Animated execution of Nearest-neighbor chain algorithm|Animation of the algorithm using Ward's distance. Black dots are points, grey regions are larger clusters, blue arrows point to nearest neighbors, and the red bar indicates the current chain. For visual simplicity, when a merge leaves the chain empty, it continues with the recently merged cluster.]]
Intuitively, the nearest neighbor chain algorithm repeatedly follows a chain of clusters {{math|''A'' → ''B'' → ''C'' → ...}} where each cluster is the nearest neighbor of the previous one, until reaching a pair of clusters that are mutual nearest neighbors.<ref name="murtagh-tcj">{{citation
| last = Murtagh | first = Fionn
Line 49 ⟶ 64:
| year = 1983
| url = http://www.multiresolutions.com/strule/old-articles/Survey_of_hierarchical_clustering_algorithms.pdf
| doi = 10.1093/comjnl/26.4.354| doi-access = free
}}.</ref>
 
In more detail, the algorithm performs the following steps:<ref name="murtagh-tcj"/><ref name="murtagh-hmds">{{citation
Line 55 ⟶ 71:
| editor1-last = Abello | editor1-first = James M.
| editor2-last = Pardalos | editor2-first = Panos M.
| editor3-last = Resende | editor3-first = Mauricio G. C. | editor3-link = Mauricio Resende
| contribution = Clustering in massive data sets
| isbn = 978-1-4020-0489-6
Line 62 ⟶ 78:
| series = Massive Computing
| title = Handbook of massive data sets
| contribution-url = https://books.google.com/books?id=_VI0LITp3ecC&pg=PA513
| volume = 4
| year = 2002}}| bibcode = 2002hmds.</ref>book.....A
}}.</ref>
*Initialize the set of active clusters to consist of {{mvar|n}} one-point clusters, one for each input point.
*Let {{mvar|S}} be a [[Stack (data structure)|stack data structure]], initially empty, the elements of which will be active clusters.
Line 73 ⟶ 90:
**Otherwise, if {{mvar|D}} is not already in {{mvar|S}}, push it onto {{mvar|S}}.
 
IfWhen it is possible thatfor thereone arecluster to have multiple equal nearest neighbors to a cluster, then the algorithm requires a consistent tie-breaking rule. For instance, one may assign arbitrary index numbers to all of the clusters,
and then select (among the equal nearest neighbors) the one with the smallest index number. This rule prevents certain kinds of inconsistent behavior in the algorithm; for instance, without such a rule, the neighboring cluster {{mvar|D}} might occur earlier in the stack than as the predecessor of {{mvar|C}}.<ref>For this tie-breaking rule, and an example of how tie-breaking is needed to prevent cycles in the nearest neighbor graph, see {{citation|contribution=Figure&nbsp;20.7|page=244|title=Algorithms in Java, Part 5: Graph Algorithms|first=Robert|last=Sedgewick|authorlink=Robert Sedgewick (computer scientist)|edition=3rd|publisher=Addison-Wesley|year=2004|isbn=0-201-36121-3}}.</ref>
 
Line 137 ⟶ 154:
| title = A general theory of classificatory sorting strategies. I. Hierarchical systems
| volume = 9
| year = 1967}}.</ref>| doi-access = free
}}.</ref>
:<math>d(A\cup B,C) = \frac{n_A+n_C}{n_A+n_B+n_C} d(A,C) + \frac{n_B+n_C}{n_A+n_B+n_C} d(B,C) - \frac{n_C}{n_A+n_B+n_C} d(A,B).</math>
Distance update formulas such as this one are called formulas "of Lance–Williams type" after the work of {{harvtxt|Lance|Williams|1967}}.
Line 182 ⟶ 200:
 
===Centroid distance===
Another distance measure commonly used in agglomerative clustering is the distance between the centroids of pairs of clusters, also known as the weighted group method.<ref name="mirkin"/><ref name="lance-williams"/> It can be calculated easily in constant time per distance calculation. However, it is not reducible. For instance, if the input forms the set of three points of an [[equilateral triangle]], merging two of these points into a larger cluster causes the inter-cluster distance to decrease, a violation of reducibility. Therefore, the nearest-neighbor chain algorithm will not necessarily find the same clustering as the greedy algorithm. Nevertheless, {{harvtxt|Murtagh|1983}} writes that the nearest-neighbor chain algorithm provides "a good [[heuristic]]" for the centroid method.<ref name="murtagh-tcj"/>
A different algorithm by {{harvtxt|Day|Edelsbrunner|1984}} can be used to find the greedy clustering in {{math|''O''(''n''<sup>2</sup>)}} time for this distance measure.<ref name="day-edels"/>
 
===Distances sensitive to merge order===
The above presentation explicitly disallowed distances sensitive to merge order. Indeed, allowing such distances can cause problems. In particular, there exist order-sensitive cluster distances which satisfy reducibility, but for which the above algorithm will return a hierarchy with suboptimal costs. Therefore, when cluster distances are defined by a recursive formula (as some of the ones discussed above are), care must be taken that they do not use the hierarchy in a way which is sensitive to merge order.<ref>{{citation
| last= Müllner
| first=Daniel
| arxiv=1109.2378v12378
| title=Modern hierarchical, agglomerative clustering algorithms
| volume=1109
| year=2011 }}.</ref> Following the earlier discussion of the value of defining cluster distances recursively (so that [[memoization]] can be used
| year=2011
to greatly speed up distance computations), care must be taken with recursively defined distances so that they are not using the hierarchy in a way which is sensitive to merge order.
| bibcode=2011arXiv1109.2378M
}}.</ref>
 
==History==
Line 224 ⟶ 244:
{{reflist|colwidth=30em}}
 
[[Category:DataCluster clusteringanalysis algorithms]]