Content deleted Content added
No edit summary |
m Open access bot: url-access=subscription updated in citation with #oabot. |
||
Line 5:
In [[data mining]] and [[statistics]], '''hierarchical clustering'''<ref name="HC">{{cite book |first=Frank |last=Nielsen | title=Introduction to HPC with MPI for Data Science | year=2016 | publisher=Springer |isbn=978-3-319-21903-5 |pages=195–211
|chapter=8. Hierarchical Clustering | url=https://www.springer.com/gp/book/9783319219028 |chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> (also called '''hierarchical cluster analysis''' or '''HCA''') is a method of [[cluster analysis]] that seeks to build a [[hierarchy]] of clusters. Strategies for hierarchical clustering generally fall into two categories:
* '''Agglomerative''': Agglomerative: Agglomerative clustering, often referred to as a "bottom-up" approach, begins with each data point as an individual cluster. At each step, the algorithm merges the two most similar clusters based on a chosen distance metric (e.g., Euclidean distance) and linkage criterion (e.g., single-linkage, complete-linkage)<ref name=":4">{{Cite journal |last=Murtagh |first=Fionn |last2=Contreras |first2=Pedro |date=2012 |title=Algorithms for hierarchical clustering: an overview |url=https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.53 |journal=WIREs Data Mining and Knowledge Discovery |language=en |volume=2 |issue=1 |pages=86–97 |doi=10.1002/widm.53 |issn=1942-4795|url-access=subscription }}</ref>. This process continues until all data points are combined into a single cluster or a stopping criterion is met. Agglomerative methods are more commonly used due to their simplicity and computational efficiency for small to medium-sized datasets <ref>{{Cite journal |last=Mojena |first=R. |date=1977-04-01 |title=Hierarchical grouping methods and stopping rules: an evaluation |url=https://academic.oup.com/comjnl/article-lookup/doi/10.1093/comjnl/20.4.359 |journal=The Computer Journal |language=en |volume=20 |issue=4 |pages=359–363 |doi=10.1093/comjnl/20.4.359 |issn=0010-4620}}</ref>.
* '''Divisive''': Divisive clustering, known as a "top-down" approach, starts with all data points in a single cluster and recursively splits the cluster into smaller ones. At each step, the algorithm selects a cluster and divides it into two or more subsets, often using a criterion such as maximizing the distance between resulting clusters. Divisive methods are less common but can be useful when the goal is to identify large, distinct clusters first.
Line 48:
| <math>\sqrt[p]{\frac{1}{|A|\cdot|B|} \sum_{a \in A }\sum_{ b \in B} d(a,b)^p}, p\neq 0</math>
|-
|[[Ward's method|Ward linkage]],<ref name="wards method">{{cite journal |last=Ward |first=Joe H. |year=1963 |title=Hierarchical Grouping to Optimize an Objective Function |journal=Journal of the American Statistical Association |volume=58 |issue=301 |pages=236–244 |doi=10.2307/2282967 |jstor=2282967 |mr=0148188}}</ref> Minimum Increase of Sum of Squares (MISSQ)<ref name=":0">{{Citation |last=Podani |first=János |title=New combinatorial clustering methods |date=1989 |url=https://doi.org/10.1007/978-94-009-2432-1_5 |work=Numerical syntaxonomy |pages=61–77 |editor-last=Mucina |editor-first=L. |place=Dordrecht |publisher=Springer Netherlands |language=en |doi=10.1007/978-94-009-2432-1_5 |isbn=978-94-009-2432-1 |access-date=2022-11-04 |editor2-last=Dale |editor2-first=M. B.|url-access=subscription }}</ref>
|<math>\frac{|A|\cdot|B|}{|A\cup B|} \lVert \mu_A - \mu_B \rVert ^2
= \sum_{x\in A\cup B} \lVert x - \mu_{A\cup B} \rVert^2
Line 76:
- \min_{m\in B} \sum_{y\in B} d(m,y)</math>
|-
|Medoid linkage<ref>{{Cite conference |last1=Miyamoto |first1=Sadaaki |last2=Kaizu |first2=Yousuke |last3=Endo |first3=Yasunori |date=2016 |title=Hierarchical and Non-Hierarchical Medoid Clustering Using Asymmetric Similarity Measures |url=https://ieeexplore.ieee.org/document/7801678 |conference=2016 Joint 8th International Conference on Soft Computing and Intelligent Systems (SCIS) and 17th International Symposium on Advanced Intelligent Systems (ISIS) |pages=400–403 |doi=10.1109/SCIS-ISIS.2016.0091|url-access=subscription }}</ref><ref>{{Cite conference |date=2016 |title=Visual Clutter Reduction through Hierarchy-based Projection of High-dimensional Labeled Data| conference=Graphics Interface |url=https://graphicsinterface.org/wp-content/uploads/gi2016-14.pdf | first1=Dominik|last1=Herr|first2=Qi|last2=Han|first3=Steffen|last3=Lohmann| first4=Thomas |last4=Ertl |access-date=2022-11-04 |website=Graphics Interface |language=en-CA |doi=10.20380/gi2016.14}}</ref>
|<math>d(m_A, m_B)</math> where <math>m_A</math>, <math>m_B</math> are the medoids of the previous clusters
|-
|