Revision as of 15:49, 14 December 2023 edit OAbot (talk \| contribs) Bots 643,717 edits m Open access bot: arxiv updated in citation with #oabot. ← Previous edit		Revision as of 00:29, 9 January 2024 edit undo Mazewaxie (talk \| contribs) Extended confirmed users, Pending changes reviewers, Rollbackers 113,748 edits m CheckWiki error #38 and/or general fixes Tag: AWB Next edit →
Line 11: Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a [[distance matrix\|matrix of distances]]. On the other hand, except for the special case of single-linkage distance, none of the algorithms (except exhaustive search in <math>\mathcal{O}(2^n)</math>) can be guaranteed to find the optimum solution. == Complexity == The standard algorithm for '''hierarchical agglomerative clustering''' (HAC) has a [[time complexity]] of <math>\mathcal{O}(n^3)</math> and requires <math>\Omega(n^2)</math> memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity <math>\mathcal{O}(n^2)</math>) are known: '''SLINK'''<!--boldface per WP:R#PLA--><ref name="SLINK">{{cite journal \| author=R. Sibson \| title=SLINK: an optimally efficient algorithm for the single-link cluster method \| journal=The Computer Journal \| volume=16 \| issue=1 \| pages=30–34 \| year=1973 \| publisher=British Computer Society \| url=http://www.cs.gsu.edu/~wkim/index_files/papers/sibson.pdf \| doi=10.1093/comjnl/16.1.30\| doi-access=free }}</ref> for [[Single-linkage clustering\|single-linkage]] and CLINK<ref name="CLINK">{{cite journal \| author=D. Defays \| title=An efficient algorithm for a complete-link method \| journal=The Computer Journal \| volume=20 \| issue=4 \| pages=364–6 \| year=1977 \| publisher=British Computer Society \| doi=10.1093/comjnl/20.4.364\| doi-access= }}</ref> for [[complete-linkage clustering]]. With a [[heap (data structure)\|heap]], the runtime of the general case can be reduced to <math>\mathcal{O}(n^2 \log n)</math>, an improvement on the aforementioned bound of <math>\mathcal{O}(n^3)</math>, at the cost of further increasing the memory requirements. In many cases, the memory overheads of this approach are too large to make it practically usable. Methods exist which use [[~~Quadtree\|quadtrees~~quadtree]]s that demonstrate <math>\mathcal{O}(n^2)</math> total running time with <math>\mathcal{O}(n)</math> space.<ref>{{Cite journal \|last=Eppstein \|first=David \|date=2001-12-31 \|title=Fast hierarchical clustering and other applications of dynamic closest pairs \|url=https://dl.acm.org/doi/10.1145/351827.351829 \|journal=ACM Journal of Experimental Algorithmics \|volume=5 \|pages=1–es \|doi=10.1145/351827.351829 \|issn=1084-6654\|arxiv=cs/9912014 }}</ref> Divisive clustering with an exhaustive search is <math>\mathcal{O}(2^n)</math>, but it is common to use faster heuristics to choose splits, such as [[k-means clustering\|''k''-means]]. Line 138 ⟶ 136: ## Find the current cluster with 2 or more objects that has the largest diameter: <math>C_* = \arg\max_{C\in \mathcal{C}} \max_{i_1,i_2\in C} \delta(i_1,i_2)</math> ## Find the object in this cluster with the most dissimilarity to the rest of the cluster: <math>i^* = \arg\max_{i\in C_} \frac{1}{\|C_\|-1}\sum_{j\in C_\setminus\{i\}} \delta(i,j)</math> ## Pop <math>i^</math> from its old cluster <math>C_</math> and put it into a new ~~<i>~~''splinter group~~</i>~~'' <math>C_\textrm{new} = \{i^\}</math>. ## As long as <math>C_</math> isn't empty, keep migrating objects from <math>C_</math> to add them to <math>C_\textrm{new}</math>. To choose which objects to migrate, don't just consider dissimilarity to <math>C_</math>, but also adjust for dissimilarity to the splinter group: let <math>i^ = \arg\max_{i\in C} D(i)</math> where we define <math>D(i) = \frac{1}{\|C_\|-1}\sum_{j\in C_\setminus\{i\}} \delta(i,j) - \frac{1}{\|C_\textrm{new}\|}\sum_{j\in C_\textrm{new}} \delta(i,j)</math>, then either stop iterating when <math>D(i^) < 0</math>, or migrate <math>i^</math>. ## Add <math>C_\textrm{new}</math> to <math>\mathcal{C}</math>.

Hierarchical clustering: Difference between revisions