Revision as of 17:37, 15 December 2016 edit Tessaract2 (talk \| contribs) Extended confirmed users 2,546 edits Added a wikilink ← Previous edit		Revision as of 18:48, 15 December 2016 edit undo David Eppstein (talk \| contribs) Autopatrolled, Administrators 235,850 edits I disagree with these moves. Will elaborate in GA review. Next edit →
Line 1: In the theory of [[cluster analysis]], the '''nearest-neighbor chain algorithm''' is an [[algorithm]] that can speed up several methods for [[agglomerative hierarchical clustering]]. These are methods that take a collection of points as input, and create a hierarchy of clusters of points by repeatedly merging pairs of smaller clusters to form larger clusters. The clustering methods that the nearest-neighbor chain algorithm can be used for include [[Ward's method]], [[complete-linkage clustering]], and [[single-linkage clustering]]; these all work by repeatedly merging the closest two clusters but use different definitions of the distance between clusters. The cluster distances for which the nearest-neighbor chain algorithm works are called ''reducible'' and are characterized by a simple inequality among certain cluster distances. The main idea of the algorithm is to find pairs of clusters to merge by following [[Path (graph theory)\|paths]] in the [[nearest neighbor graph]] of the clusters. Every such path will eventually terminate at a pair of clusters that are nearest neighbors of each other, and the algorithm chooses that pair of clusters as the pair to merge. In order to save work by re-using as much as possible of each path, the algorithm uses a [[Stack (abstract data type)\|stack data structure]] to keep track of each path that it follows. By following paths in this way, the nearest-neighbor chain algorithm merges its clusters in a different order than methods that always find and merge the closest pair of clusters. However, despite that difference, it always generates the same hierarchy of clusters.▼ The nearest-neighbor chain algorithm constructs a clustering in time proportional to the square of the number of points to be clustered. This is also proportional to the size of its input, when the input is provided in the form of an explicit ~~[[Adjacency matrix\|~~distance matrix]]. The algorithm uses an amount of memory proportional to the number of points, when it is used for clustering methods such as Ward's method that allow constant-time calculation of the distance between clusters. However, for some other clustering methods it uses a larger amount of memory in an auxiliary data structure with which it keeps track of the distances between pairs of clusters.▼ ==Background== [[File:Hierarchical clustering diagram.png\|thumb\|upright=1.35\|A hierarchical clustering of six points. The points to be clustered are at the top of the diagram, and the nodes below them represent clusters.]] ▲The main idea of the algorithm is to find pairs of clusters to merge by following [[Path (graph theory)\|paths]] in the [[nearest neighbor graph]] of the clusters. Every such path will eventually terminate at a pair of clusters that are nearest neighbors of each other, and the algorithm chooses that pair of clusters as the pair to merge. In order to save work by re-using as much as possible of each path, the algorithm uses a [[Stack (abstract data type)\|stack data structure]] to keep track of each path that it follows. By following paths in this way, the nearest-neighbor chain algorithm merges its clusters in a different order than methods that always find and merge the closest pair of clusters. However, despite that difference, it always generates the same hierarchy of clusters. ▲The nearest-neighbor chain algorithm constructs a clustering in time proportional to the square of the number of points to be clustered. This is also proportional to the size of its input, when the input is provided in the form of an explicit [[Adjacency matrix\|distance matrix]]. The algorithm uses an amount of memory proportional to the number of points, when it is used for clustering methods such as Ward's method that allow constant-time calculation of the distance between clusters. However, for some other clustering methods it uses a larger amount of memory in an auxiliary data structure with which it keeps track of the distances between pairs of clusters. The input to a clustering problem consists of a set of points.<ref name="murtagh-tcj"/> A ''cluster'' is any proper subset of the points, and a hierarchical clustering is a [[maximal element\|maximal]] family of clusters with the property that any two clusters in the family are either nested or [[disjoint set\|disjoint]]. Alternatively, a hierarchical clustering may be represented as a [[binary tree]] with the points at its leaves; the clusters of the clustering are the sets of points in subtrees descending from each node of the tree.<ref>{{citation\|title=Clustering\|volume=10\|series=IEEE Press Series on Computational Intelligence\|first1=Rui\|last1=Xu\|first2=Don\|last2=Wunsch\|publisher=John Wiley & Sons\|year=2008\|isbn=978-0-470-38278-3\|page=31\|contribution-url=https://books.google.com/books?id=kYC3YCyl_tkC&pg=PA31\|contribution=3.1 Hierarchical Clustering: Introduction}}.</ref>

Nearest-neighbor chain algorithm: Difference between revisions