Content deleted Content added
Source correctness and complexity |
Explain Lance–Williams nomenclature |
||
Line 126:
| year = 2011}}.</ref> Alternatively, this distance can be seen as the difference in [[:en:k-means clustering|k-means cost]] between the new cluster and the two old clusters.
Ward's distance is also reducible, as can be seen more easily from a different formula
| last1 = Lance | first1 = G. N.
| last2 = Williams | first2 = W. T. | author2-link = W. T. Williams
Line 137:
| year = 1967}}.</ref>
:<math>d(A\cup B,C) = \frac{n_A+n_C}{n_A+n_B+n_C} d(A,C) + \frac{n_B+n_C}{n_A+n_B+n_C} d(B,C) - \frac{n_C}{n_A+n_B+n_C} d(A,B).</math>
Distance update formulas such as this one are called formulas "of Lance–Williams type after the work of {{harvtxt|Lance|Williams|1967}}.
If <math>d(A,B)</math> is the smallest of the three distances on the right hand side (as would necessarily be true if <math>A</math> and <math>B</math> are mutual nearest-neighbors) then the negative contribution from its term is cancelled by the <math>n_C</math> coefficient of one of the two other terms, leaving a positive value added to the weighted average of the other two distances. Therefore, the combined distance is always at least as large as the minimum of <math>d(A,C)</math> and <math>d(B,C)</math>, meeting the definition of reducibility.
Line 142 ⟶ 143:
===Complete linkage and average distance===
[[Complete-linkage clustering|Complete-linkage]] or furthest-neighbor clustering is a form of agglomerative clustering that defines the dissimilarity between clusters to be the maximum distance between any two points from the two clusters. Similarly, average-distance clustering uses the average pairwise distance as the dissimilarity. Like Ward's distance, these two forms of clustering obey a formula of
Unlike Ward's method, these two forms of clustering do not have a constant-time method for computing distances between pairs of clusters. Instead it is possible to maintain an array of distances between all pairs of clusters. Whenever two clusters are merged, the formula can be used to compute the distance between the merged cluster and all other clusters. Maintaining this array over the course of the clustering algorithm takes time and space {{math|''O''(''n''<sup>2</sup>)}}. The nearest-neighbor chain algorithm may be used in conjunction with this array of distances to find the same clustering as the greedy algorithm for these cases. Its total time and space, using this array, is also {{math|''O''(''n''<sup>2</sup>)}}.<ref name="gm07"/>
|