Content deleted Content added
→Centroid distance: "a good heuristic" |
Source correctness and complexity |
||
Line 76:
==Time and space analysis==
Each iteration of the loop performs a single search for the nearest neighbor of a cluster, and either adds one cluster to the stack or removes two clusters from it. Every cluster is only ever added once to the stack, because when it is removed again it is immediately made inactive and merged. There are a total of {{math|2''n'' − 2}} clusters that ever get added to the stack: {{math|''n''}} single-point clusters in the initial set, and {{math|''n'' − 2}} internal nodes other than the root in the binary tree representing the clustering. Therefore, the algorithm performs {{math|2''n'' − 2}} pushing iterations and {{math|''n'' − 1}} popping iterations.<ref name="murtagh-tcj"/>
Each of these iterations may spend time scanning as many as {{math|''n'' − 1}} inter-cluster distances to find the nearest neighbor.
The total number of distance calculations it makes is therefore less than {{math|3''n''<sup>2</sup>}}.
For the same reason, the total time used by the algorithm outside of these distance calculations is {{math|O(''n''<sup>2</sup>)}}.<ref name="murtagh-tcj"/>
Since the only data structure is the set of active clusters and the stack containing a subset of the active clusters, the space required is linear in the number of input points.<ref name="murtagh-tcj"/>
==Correctness==
Line 89:
will in general perform its merges in a different order than the nearest-neighbor chain algorithm. Both of these properties depend on the specific choice of how to measure the distance between clusters.
The correctness of this algorithm relies on a property of its distance function called ''reducibility'', identified by {{harvtxt|Bruynooghe|1977}} in connection with an earlier clustering method that used mutual nearest neighbor pairs but not chains of nearest neighbors.<ref name="b77">{{citation|first=Michel|last=Bruynooghe|title=Méthodes nouvelles en classification automatique de données taxinomiqes nombreuses|journal=Statistique et Analyse des Données|volume=3|pages=24–42|year=1977|url=http://www.numdam.org/item?id=SAD_1977__2_3_24_0}}.</ref> A distance function {{mvar|d}} on clusters is defined to be reducible if, for every three clusters {{mvar|A}}, {{mvar|B}} and {{mvar|C}} in the greedy hierarchical clustering such that {{mvar|A}} and {{mvar|B}} are mutual nearest neighbors, the following inequality holds:<ref name="murtagh-tcj"/>
:{{math|''d''(''A'' ∪ ''B'', ''C'') ≥ min(d(''A'',''C''), d(''B'',''C''))}}.
If a distance function has the reducibility property, then merging two clusters {{mvar|C}} and {{mvar|D}} can only cause the nearest neighbor of {{mvar|E}} to change if that nearest neighbor was one of {{mvar|C}} and {{mvar|D}}. This has two important consequences for the nearest neighbor chain algorithm. First, it can be shown using this property that, at each step of the algorithm, the clusters on the stack {{mvar|S}} form a valid chain of nearest neighbors, because whenever a nearest neighbor becomes invalidated it is immediately removed from the stack.<ref name="murtagh-tcj"/>
Second, and even more importantly, it follows from this property that, if two clusters {{mvar|C}} and {{mvar|D}} both belong to the greedy hierarchical clustering, and are mutual nearest neighbors at any point in time, then they will be merged by the greedy clustering, for they must remain mutual nearest neighbors until they are merged. It follows that each mutual nearest neighbor pair found by the nearest neighbor chain algorithm is also a pair of clusters found by the greedy algorithm, and therefore that the nearest neighbor chain algorithm computes exactly the same clustering (although in a different order) as the greedy algorithm.<ref name="murtagh-tcj"/>
==Application to specific clustering distances==
|