Content deleted Content added
m Open access bot: arxiv updated in citation with #oabot. |
m CheckWiki error #38 and/or general fixes |
||
Line 11:
Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a [[distance matrix|matrix of distances]]. On the other hand, except for the special case of single-linkage distance, none of the algorithms (except exhaustive search in <math>\mathcal{O}(2^n)</math>) can be guaranteed to find the optimum solution.
== Complexity ==
The standard algorithm for '''hierarchical agglomerative clustering''' (HAC) has a [[time complexity]] of <math>\mathcal{O}(n^3)</math> and requires <math>\Omega(n^2)</math> memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity <math>\mathcal{O}(n^2)</math>) are known: '''SLINK'''<!--boldface per WP:R#PLA--><ref name="SLINK">{{cite journal | author=R. Sibson | title=SLINK: an optimally efficient algorithm for the single-link cluster method | journal=The Computer Journal | volume=16 | issue=1 | pages=30–34 | year=1973 | publisher=British Computer Society | url=http://www.cs.gsu.edu/~wkim/index_files/papers/sibson.pdf | doi=10.1093/comjnl/16.1.30| doi-access=free }}</ref> for [[Single-linkage clustering|single-linkage]] and CLINK<ref name="CLINK">{{cite journal | author=D. Defays | title=An efficient algorithm for a complete-link method | journal=The Computer Journal | volume=20 | issue=4 | pages=364–6 | year=1977 | publisher=British Computer Society | doi=10.1093/comjnl/20.4.364| doi-access= }}</ref> for [[complete-linkage clustering]]. With a [[heap (data structure)|heap]], the runtime of the general case can be reduced to <math>\mathcal{O}(n^2 \log n)</math>, an improvement on the aforementioned bound of <math>\mathcal{O}(n^3)</math>, at the cost of further increasing the memory requirements. In many cases, the memory overheads of this approach are too large to make it practically usable. Methods exist which use [[
Divisive clustering with an exhaustive search is <math>\mathcal{O}(2^n)</math>, but it is common to use faster heuristics to choose splits, such as [[k-means clustering|''k''-means]].
Line 138 ⟶ 136:
## Find the current cluster with 2 or more objects that has the largest diameter: <math>C_* = \arg\max_{C\in \mathcal{C}} \max_{i_1,i_2\in C} \delta(i_1,i_2)</math>
## Find the object in this cluster with the most dissimilarity to the rest of the cluster: <math>i^* = \arg\max_{i\in C_*} \frac{1}{|C_*|-1}\sum_{j\in C_*\setminus\{i\}} \delta(i,j)</math>
## Pop <math>i^*</math> from its old cluster <math>C_*</math> and put it into a new
## As long as <math>C_*</math> isn't empty, keep migrating objects from <math>C_*</math> to add them to <math>C_\textrm{new}</math>. To choose which objects to migrate, don't just consider dissimilarity to <math>C_*</math>, but also adjust for dissimilarity to the splinter group: let <math>i^* = \arg\max_{i\in C} D(i)</math> where we define <math>D(i) = \frac{1}{|C_*|-1}\sum_{j\in C_*\setminus\{i\}} \delta(i,j) - \frac{1}{|C_\textrm{new}|}\sum_{j\in C_\textrm{new}} \delta(i,j)</math>, then either stop iterating when <math>D(i^*) < 0</math>, or migrate <math>i^*</math>.
## Add <math>C_\textrm{new}</math> to <math>\mathcal{C}</math>.
|