Nearest-neighbor chain algorithm: Difference between revisions

Content deleted Content added
History: more context
Background: illo; ce
Line 6:
 
==Background==
[[File:Hierarchical clustering diagram.png|thumb|upright=1.35|A hierarchical clustering of six points. The points to be clustered are at the top of the diagram, and the nodes below them represent clusters.]]
The input to a clustering problem consists of a set of points. A ''cluster'' is any proper subset of the points, and a hierarchical clustering is a [[maximal element|maximal]] family of clusters with the property that any two clusters in the family are either nested or [[disjoint set|disjoint]].
Alternatively, a hierarchical clustering may be represented as a [[binary tree]] with the points at its leaves; the clusters of the clustering are the sets of points in subtrees descending from each node of the tree.
Line 11 ⟶ 12:
In agglomerative clustering methods, the input also includes a distance function defined on the points, or a numerical measure of their dissimilarity that is symmetric (insensitive to the ordering within each pair of points) but (unlike a distance) may not satisfy the triangle inequality. Depending on the method, this dissimilarity function can be extended in several different ways to pairs of clusters; for instance, in the [[single-linkage clustering]] method, the distance between two clusters is defined to be the minimum distance between any two points from each cluster. Given this distance between clusters, a hierarchical clustering may be defined by a [[greedy algorithm]] that initially places each point in its own single-point cluster and then repeatedly merges the [[closest pair]] of clusters.<ref name="murtagh-tcj"/>
 
However, knownKnown methods for repeatedly finding the closest pair of clusters in a dynamic set of clusters either require superlinear space to maintain a [[data structure]] that can find closest pairs quickly, or they take greater than linear time to find each closest pair.<ref name="e-jea">{{citation
| last = Eppstein | first = David | authorlink = David Eppstein
| arxiv = cs.DS/9912014
Line 31 ⟶ 32:
| url = http://www.cs.duke.edu/~edels/Papers/1984-J-05-HierarchicalClustering.pdf
| volume = 1
| year = 1984}}.</ref> The nearest-neighbor chain algorithm uses a smaller amount of time and space than the greedy algorithm by merging pairs of clusters in a different order. HoweverIn this way, it avoids the problem of repeatedly finding closest pairs. Nevertheless, for many types of clustering problem, it can be guaranteed to come up with the same hierarchical clustering as the greedy algorithm despite the different merge order.
 
==The algorithm==