Nearest-neighbor chain algorithm: Difference between revisions

Content deleted Content added
increasing link information, agglomerative vs divisive hierarchical clustering
expand lead; split off history
Line 1:
In the theory of [[cluster analysis]], the '''nearest-neighbor chain algorithm''' is a method that can be used to perform several types of [[agglomerative hierarchical clustering]], usingin anwhich amounta ofhierarchy memoryof thatclusters is linearcreated inby therepeatedly numbermerging pairs of pointssmaller clusters to form larger clusters. In particular it can be clusteredused for [[Ward's method]], [[complete-linkage clustering]], and an[[single-linkage amountclustering]], ofwhich timeall linearwork inby merging the numberclosest oftwo distinctclusters distancesunder betweendifferent pairsdefinitions of points.<refthe name="murtagh-hmds">{{citationdistance between clusters.
 
| last = Murtagh | first = Fionn
| year = 2002}}.</ref> The main idea of the algorithm is to find pairs of clusters to merge by following paths in the [[nearest neighbor graph]] of the clusters until the paths terminate in pairs of mutual nearest neighbors. TheThat algorithmpair wasof developedclusters andis implementedthen inchosen 1982as the pair to merge by J.the Palgorithm. Benzécri<ref>{{citationThe resulting system of merges occurs in a different order than in a naive implementation of the same clustering algorithms, but can be shown to generate the same hierarchy of clusters.
| editor1-last = Abello | editor1-first = James M.
 
| editor2-last = Pardalos | editor2-first = Panos M.
The nearest-neighbor chain algorithm uses an amount of memory that is linear in the number of points to be clustered, and determines a clustering in time quadratic in the number of points (linear in the input size, when the input is given as an explicit distance matrix).
| editor3-last = Resende | editor3-first = Mauricio G. C.
| contribution = Clustering in massive data sets
| isbn = 978-1-4020-0489-6
| pages = 513–516
| publisher = Springer
| series = Massive Computing
| title = Handbook of massive data sets
| url = http://books.google.com/books?id=_VI0LITp3ecC&pg=PA513
| volume = 4
| year = 2002}}.</ref> The main idea of the algorithm is to find pairs of clusters to merge by following paths in the [[nearest neighbor graph]] of the clusters until the paths terminate in pairs of mutual nearest neighbors. The algorithm was developed and implemented in 1982 by J. P. Benzécri<ref>{{citation
| last = Benzécri | first = J.-P.
| issue = 2
| journal = Les Cahiers de l'Analyse des Données
| pages = 209–218
| title = Construction d'une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques
| url = http://www.numdam.org/item?id=CAD_1982__7_2_209_0
| volume = 7
| year = 1982}}.</ref> and J. Juan,<ref>{{citation
| last = Juan | first = J.
| issue = 2
| journal = Les Cahiers de l'Analyse des Données
| pages = 219–225
| title = Programme de classification hiérarchique par l'algorithme de la recherche en chaîne des voisins réciproques
| url = http://www.numdam.org/item?id=CAD_1982__7_2_219_0
| volume = 7
| year = 1982}}.</ref> based on earlier methods that constructed hierarchical clusterings using mutual nearest neighbor pairs without taking advantage of nearest neighbor chains.<ref name="b77"/><ref>{{citation
| last = de Rham | first = C.
| issue = 2
| journal = Les Cahiers de l'Analyse des Données
| pages = 135–144
| title = La classification hiérarchique ascendante selon la méthode des voisins réciproques
| url = http://www.numdam.org/item?id=CAD_1980__5_2_135_0
| volume = 5
| year = 1980}}.</ref>
 
==Background==
Line 77 ⟶ 44:
| doi = 10.1093/comjnl/26.4.354}}.</ref>
 
More formally, the algorithm performs the following steps:<ref name="murtagh-hmdstcj"/><ref name="murtagh-tcjhmds"/>{{citation
| last = Murtagh | first = Fionn
| editor1-last = Abello | editor1-first = James M.
| editor2-last = Pardalos | editor2-first = Panos M.
| editor3-last = Resende | editor3-first = Mauricio G. C.
| contribution = Clustering in massive data sets
| isbn = 978-1-4020-0489-6
| pages = 513–516
| publisher = Springer
| series = Massive Computing
| title = Handbook of massive data sets
| url = http://books.google.com/books?id=_VI0LITp3ecC&pg=PA513
| volume = 4
| year = 2002}}.</ref>
*Initialize the set of active clusters to consist of {{mvar|n}} one-point clusters, one for each input point.
*Let {{mvar|S}} be a [[Stack (data structure)|stack data structure]], initially empty, the elements of which will be active clusters.
Line 190 ⟶ 170:
| year=2011 }}.</ref> Following the earlier discussion of the value of defining cluster distances recursively (so that [[memoization]] can be used
to greatly speed up distance computations), care must be taken with recursively defined distances so that they are not using the hierarchy in a way which is sensitive to merge order.
 
==History==
The algorithm was developed and implemented in 1982 by J. P. Benzécri<ref>{{citation
| last = Benzécri | first = J.-P.
| issue = 2
| journal = Les Cahiers de l'Analyse des Données
| pages = 209–218
| title = Construction d'une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques
| url = http://www.numdam.org/item?id=CAD_1982__7_2_209_0
| volume = 7
| year = 1982}}.</ref> and J. Juan,.<ref>{{citation
| last = Juan | first = J.
| issue = 2
| journal = Les Cahiers de l'Analyse des Données
| pages = 219–225
| title = Programme de classification hiérarchique par l'algorithme de la recherche en chaîne des voisins réciproques
| url = http://www.numdam.org/item?id=CAD_1982__7_2_219_0
| volume = 7
| year = 1982}}.</ref> They based this algorithm on earlier methods that constructed hierarchical clusterings using mutual nearest neighbor pairs without taking advantage of nearest neighbor chains.<ref name="b77"/><ref>{{citation
| last = de Rham | first = C.
| issue = 2
| journal = Les Cahiers de l'Analyse des Données
| pages = 135–144
| title = La classification hiérarchique ascendante selon la méthode des voisins réciproques
| url = http://www.numdam.org/item?id=CAD_1980__5_2_135_0
| volume = 5
| year = 1980}}.</ref>
 
==References==