Hierarchical clustering: Difference between revisions

Content deleted Content added
AnomieBOT (talk | contribs)
m Dating maintenance tags: {{Cn}}
Citation bot (talk | contribs)
Add: bibcode, authors 1-1. Removed URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Abductive | Category:Cluster analysis algorithms | #UCB_Category 17/42
 
(26 intermediate revisions by 14 users not shown)
Line 3:
{{Machine learning|Clustering}}
 
In [[data mining]] and [[statistics]], '''hierarchical clustering'''<ref (alsoname="HC">{{cite calledbook '''hierarchical|first=Frank cluster|last=Nielsen analysis'''| or '''HCA''') is a method of [[cluster analysis]] that seekstitle=Introduction to buildHPC awith [[hierarchy]]MPI offor clusters.Data Strategies forScience hierarchical| clustering generallyyear=2016 fall| intopublisher=Springer two|isbn=978-3-319-21903-5 categories:|pages=195–211
|chapter=8. Hierarchical Clustering | url=https://www.springer.com/gp/book/9783319219028 |chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> are(also usuallycalled presented'''hierarchical incluster analysis''' or '''HCA''') is a method of [[dendrogramcluster analysis]] that seeks to build a [[hierarchy]] of clusters. Strategies for hierarchical clustering generally fall into two categories:
* '''Agglomerative''': This is a "[[Top-down and bottom-up design|bottom-up]]" approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
* '''Agglomerative''': Agglomerative clustering, often referred to as a "bottom-up" approach, begins with each data point as an individual cluster. At each step, the algorithm merges the two most similar clusters based on a chosen distance metric (e.g., Euclidean distance) and linkage criterion (e.g., single-linkage, complete-linkage).<ref name=":4">{{Cite journal |last1=Murtagh |first1=Fionn |last2=Contreras |first2=Pedro |date=2012 |title=Algorithms for hierarchical clustering: an overview |url=https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.53 |journal=WIREs Data Mining and Knowledge Discovery |language=en |volume=2 |issue=1 |pages=86–97 |doi=10.1002/widm.53 |issn=1942-4795|url-access=subscription }}</ref> This process continues until all data points are combined into a single cluster or a stopping criterion is met. Agglomerative methods are more commonly used due to their simplicity and computational efficiency for small to medium-sized datasets.<ref>{{Cite journal |last=Mojena |first=R. |date=1977-04-01 |title=Hierarchical grouping methods and stopping rules: an evaluation |url=https://academic.oup.com/comjnl/article-lookup/doi/10.1093/comjnl/20.4.359 |journal=The Computer Journal |language=en |volume=20 |issue=4 |pages=359–363 |doi=10.1093/comjnl/20.4.359 |issn=0010-4620}}</ref>
* '''Divisive''': This is a "[[Top-down and bottom-up design|top-down]]" approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
* '''Divisive''': Divisive clustering, known as a "top-down" approach, starts with all data points in a single cluster and recursively splits the cluster into smaller ones. At each step, the algorithm selects a cluster and divides it into two or more subsets, often using a criterion such as maximizing the distance between resulting clusters. Divisive methods are less common but can be useful when the goal is to identify large, distinct clusters first.
 
In general, the merges and splits are determined in a [[greedy algorithm|greedy]] manner. The results of hierarchical clustering<ref>{{cite book |firstname=Frank"HC" |last=Nielsen/> |are title=Introductionusually topresented HPCin witha MPI[[dendrogram]]. for Data Science | year=2016 | publisher=Springer |isbn=978-3-319-21903-5 |pages=195–211
|chapter=8. Hierarchical Clustering | url=https://www.springer.com/gp/book/9783319219028 |chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> are usually presented in a [[dendrogram]].
 
Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a [[distance matrix|matrix of distances]]. On the other hand, except for the special case of single-linkage distance, none of the algorithms (except exhaustive search in <math>\mathcal{O}(2^n)</math>) can be guaranteed to find the optimum solution.{{cn|date=October 2024}}
 
== Complexity ==
The standard algorithm for '''hierarchical agglomerative clustering''' (HAC) has a [[time complexity]] of <math>\mathcal{O}(n^3)</math> and requires <math>\Omega(n^2)</math> memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity <math>\mathcal{O}(n^2)</math>) are known: '''SLINK'''<!--boldface per WP:R#PLA--><ref name="SLINK">{{cite journal | author=R. Sibson | title=SLINK: an optimally efficient algorithm for the single-link cluster method | journal=The Computer Journal | volume=16 | issue=1 | pages=30–34 | year=1973 | publisher=British Computer Society | url=http://www.cs.gsu.edu/~wkim/index_files/papers/sibson.pdf | doi=10.1093/comjnl/16.1.30| doi-access=free }}</ref> for [[Single-linkage clustering|single-linkage]] and CLINK<ref name="CLINK">{{cite journal | author=D. Defays | title=An efficient algorithm for a complete-link method | journal=The Computer Journal | volume=20 | issue=4 | pages=364–6 | year=1977 | publisher=British Computer Society | doi=10.1093/comjnl/20.4.364| doi-access= }}</ref> for [[complete-linkage clustering]]. With a [[heap (data structure)|heap]], the runtime of the general case can be reduced to <math>\mathcal{O}(n^2 \log n)</math>, an improvement on the aforementioned bound of <math>\mathcal{O}(n^3)</math>, at the cost of further increasing the memory requirements. In many cases, the memory overheads of this approach are too large to make it practically usable. Methods exist which use [[quadtree]]s that demonstrate <math>\mathcal{O}(n^2)</math> total running time with <math>\mathcal{O}(n)</math> space.<ref name=DE>{{Cite journal |last=Eppstein |first=David |date=2001-12-31 |title=Fast hierarchical clustering and other applications of dynamic closest pairs |url=https://dl.acm.org/doi/10.1145/351827.351829 |journal=ACM Journal of Experimental Algorithmics |volume=5 |pages=1–es |doi=10.1145/351827.351829 |issn=1084-6654|arxiv=cs/9912014 }}</ref>
 
Divisive clustering with an exhaustive search is <math>\mathcal{O}(2^n)</math>, but it is common to use faster heuristics to choose splits, such as [[k-means clustering|''k''-means]].
Line 45:
|<math>d(i\cup j, k) = d(m_{i\cup j}, m_k)</math> where <math>m_{i\cup j} = \tfrac{1}{2}\left(m_i + m_j\right)</math>
|-
|Versatile linkage clustering<ref name=":6">{{cite journal | doi=10.1007/s00357-019-09339-z | last1=Fernández | first1=Alberto | last2=Gómez | first2=Sergio | title=Versatile linkage: a family of space-conserving strategies for agglomerative hierarchical clustering | journal=Journal of Classification | volume=37 | year=2020 | issue=3 | pages=584–597| arxiv=1906.09222 | s2cid=195317052 }}</ref>
| <math>\sqrt[p]{\frac{1}{|A|\cdot|B|} \sum_{a \in A }\sum_{ b \in B} d(a,b)^p}, p\neq 0</math>
|-
|[[Ward's method|Ward linkage]],<ref name="wards method">{{cite journal |last=Ward |first=Joe H. |year=1963 |title=Hierarchical Grouping to Optimize an Objective Function |journal=Journal of the American Statistical Association |volume=58 |issue=301 |pages=236–244 |doi=10.2307/2282967 |jstor=2282967 |mr=0148188}}</ref> Minimum Increase of Sum of Squares (MISSQ)<ref name=":0">{{Citation |last=Podani |first=János |title=New combinatorial clustering methods |date=1989 |url=https://doi.org/10.1007/978-94-009-2432-1_5 |work=Numerical syntaxonomy |pages=61–77 |editor-last=Mucina |editor-first=L. |place=Dordrecht |publisher=Springer Netherlands |language=en |doi=10.1007/978-94-009-2432-1_5 |isbn=978-94-009-2432-1 |access-date=2022-11-04 |editor2-last=Dale |editor2-first=M. B.|url-access=subscription }}</ref>
|<math>\frac{|A|\cdot|B|}{|A\cup B|} \lVert \mu_A - \mu_B \rVert ^2
= \sum_{x\in A\cup B} \lVert x - \mu_{A\cup B} \rVert^2
Line 76:
- \min_{m\in B} \sum_{y\in B} d(m,y)</math>
|-
|Medoid linkage<ref>{{Cite conference |last1=Miyamoto |first1=Sadaaki |last2=Kaizu |first2=Yousuke |last3=Endo |first3=Yasunori |date=2016 |title=Hierarchical and Non-Hierarchical Medoid Clustering Using Asymmetric Similarity Measures |url=https://ieeexplore.ieee.org/document/7801678 |conference=2016 Joint 8th International Conference on Soft Computing and Intelligent Systems (SCIS) and 17th International Symposium on Advanced Intelligent Systems (ISIS) |pages=400–403 |doi=10.1109/SCIS-ISIS.2016.0091}}</ref><ref>{{Cite conference |date=2016 |title=Visual Clutter Reduction through Hierarchy-based Projection of High-dimensional Labeled Data| conference=Graphics Interface |url=https://graphicsinterface.org/wp-content/uploads/gi2016-14.pdf | first1=Dominik|last1=Herr|first2=Qi|last2=Han|first3=Steffen|last3=Lohmann| first4=Thomas |last4=Ertl |access-date=2022-11-04 |website=Graphics Interface |language=en-CA |doi=10.20380/gi2016.14}}</ref>
|<math>d(m_A, m_B)</math> where <math>m_A</math>, <math>m_B</math> are the medoids of the previous clusters
|-
Line 86:
* The probability that candidate clusters spawn from the same distribution function (V-linkage).
* The product of in-degree and out-degree on a k-nearest-neighbour graph (graph degree linkage).<ref>{{Cite book|last1=Zhang|first1=Wei|last2=Wang|first2=Xiaogang|last3=Zhao|first3=Deli|last4=Tang|first4=Xiaoou|title=Computer Vision – ECCV 2012 |chapter=Graph Degree Linkage: Agglomerative Clustering on a Directed Graph |date=2012|editor-last=Fitzgibbon|editor-first=Andrew|editor2-last=Lazebnik|editor2-first=Svetlana|editor2-link= Svetlana Lazebnik |editor3-last=Perona|editor3-first=Pietro|editor4-last=Sato|editor4-first=Yoichi|editor5-last=Schmid|editor5-first=Cordelia|series=Lecture Notes in Computer Science|language=en|publisher=Springer Berlin Heidelberg|volume=7572|pages=428–441|doi=10.1007/978-3-642-33718-5_31|isbn=9783642337185|arxiv=1208.5092|bibcode=2012arXiv1208.5092Z|s2cid=14751}} See also: https://github.com/waynezhanghk/gacluster</ref>
* The increment of some cluster descriptor (i.e., a quantity defined for measuring the quality of a cluster) after merging two clusters.<ref>{{cite journal |first1=W. |last1=Zhang |first2=D. |last2=Zhao |first3=X. |last3=Wang |title=Agglomerative clustering via maximum incremental path integral |journal=Pattern Recognition |volume=46 |issue=11 |pages=3056–65 |date=2013 |doi=10.1016/j.patcog.2013.04.013 |citeseerx=10.1.1.719.5355 |bibcode=2013PatRe..46.3056Z}}</ref><ref>{{cite book |last1=Zhao |first1=D. |last2=Tang |first2=X. |chapter=Cyclizing clusters via zeta function of a graph |chapter-url= |title=NIPS'08: Proceedings of the 21st International Conference on Neural Information Processing Systems |date=2008 |isbn=9781605609492 |pages=1953–60 |publisher=Curran |citeseerx=10.1.1.945.1649}}</ref><ref>{{cite journal |first1=Y. |last1=Ma |first2=H. |last2=Derksen |first3=W. |last3=Hong |first4=J. |last4=Wright |title=Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=29 |issue=9 |pages=1546–62 |date=2007 |doi=10.1109/TPAMI.2007.1085 |pmid=17627043 |bibcode=2007ITPAM..29.1546M |hdl=2142/99597 |s2cid=4591894 |hdl-access=free }}</ref>
 
== Agglomerative clustering example ==