Revision as of 21:33, 26 April 2023 edit 200.91.211.25 (talk) →Cluster Linkage ← Previous edit		Revision as of 10:03, 3 July 2023 edit undo MewTheEditor (talk \| contribs) 191 edits Moved second half of the lead to a section. Next edit →
Line 10: \|chapter=8. Hierarchical Clustering \| url=https://www.springer.com/gp/book/9783319219028 \|chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> are usually presented in a [[dendrogram]]. Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a [[distance matrix\|matrix of distances]]. On the other hand, except for the special case of single-linkage distance, none of the algorithms (except exhaustive search in <math>\mathcal{O}(2^n)</math>) can be guaranteed to find the optimum solution.▼ The standard algorithm for '''hierarchical agglomerative clustering''' (HAC) has a [[time complexity]] of <math>\mathcal{O}(n^3)</math> and requires <math>\Omega(n^2)</math> memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity <math>\mathcal{O}(n^2)</math>) are known: '''SLINK'''<!--boldface per WP:R#PLA--><ref name="SLINK">{{cite journal \| author=R. Sibson \| title=SLINK: an optimally efficient algorithm for the single-link cluster method \| journal=The Computer Journal \| volume=16 \| issue=1 \| pages=30–34 \| year=1973 \| publisher=British Computer Society \| url=http://www.cs.gsu.edu/~wkim/index_files/papers/sibson.pdf \| doi=10.1093/comjnl/16.1.30\| doi-access=free }}</ref> for [[Single-linkage clustering\|single-linkage]] and CLINK<ref name="CLINK">{{cite journal \| author=D. Defays \| title=An efficient algorithm for a complete-link method \| journal=The Computer Journal \| volume=20 \| issue=4 \| pages=364–6 \| year=1977 \| publisher=British Computer Society \| doi=10.1093/comjnl/20.4.364\| doi-access=free }}</ref> for [[complete-linkage clustering]]. With a [[heap (data structure)\|heap]], the runtime of the general case can be reduced to <math>\mathcal{O}(n^2 \log n)</math>, an improvement on the aforementioned bound of <math>\mathcal{O}(n^3)</math>, at the cost of further increasing the memory requirements. In many cases, the memory overheads of this approach are too large to make it practically usable.▼ ~~Except for the special case of single-linkage, none of the algorithms (except exhaustive search in <math>\mathcal{O}(2^n)</math>) can be guaranteed to find the optimum solution.~~ Divisive clustering with an exhaustive search is <math>\mathcal{O}(2^n)</math>, but it is common to use faster heuristics to choose splits, such as [[k-means clustering\|''k''-means]].▼ == Complexity == ▲Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a [[distance matrix\|matrix of distances]]. ▲The standard algorithm for '''hierarchical agglomerative clustering''' (HAC) has a [[time complexity]] of <math>\mathcal{O}(n^3)</math> and requires <math>\Omega(n^2)</math> memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity <math>\mathcal{O}(n^2)</math>) are known: '''SLINK'''<!--boldface per WP:R#PLA--><ref name="SLINK">{{cite journal \| author=R. Sibson \| title=SLINK: an optimally efficient algorithm for the single-link cluster method \| journal=The Computer Journal \| volume=16 \| issue=1 \| pages=30–34 \| year=1973 \| publisher=British Computer Society \| url=http://www.cs.gsu.edu/~wkim/index_files/papers/sibson.pdf \| doi=10.1093/comjnl/16.1.30\| doi-access=free }}</ref> for [[Single-linkage clustering\|single-linkage]] and CLINK<ref name="CLINK">{{cite journal \| author=D. Defays \| title=An efficient algorithm for a complete-link method \| journal=The Computer Journal \| volume=20 \| issue=4 \| pages=364–6 \| year=1977 \| publisher=British Computer Society \| doi=10.1093/comjnl/20.4.364\| doi-access=free }}</ref> for [[complete-linkage clustering]]. With a [[heap (data structure)\|heap]], the runtime of the general case can be reduced to <math>\mathcal{O}(n^2 \log n)</math>, an improvement on the aforementioned bound of <math>\mathcal{O}(n^3)</math>, at the cost of further increasing the memory requirements. In many cases, the memory overheads of this approach are too large to make it practically usable. ▲Divisive clustering with an exhaustive search is <math>\mathcal{O}(2^n)</math>, but it is common to use faster heuristics to choose splits, such as [[k-means clustering\|''k''-means]]. == Cluster Linkage ==

Hierarchical clustering: Difference between revisions