Content deleted Content added
m Adding Good Article icon |
→Background: deeper into the background |
||
Line 8:
==Background==
[[File:Hierarchical clustering diagram.png|thumb|upright=1.35|A hierarchical clustering of six points. The points to be clustered are at the top of the diagram, and the nodes below them represent clusters.]]
Many problems in [[data analysis]] concern [[Cluster analysis|clustering]], grouping data items into clusters of closely related items. [[Hierarchical clustering]] is a version of cluster analysis in which the clusters form a hierarchy or tree-like structure rather than a strict partition of the data items. In some cases, this type of clustering may be performed as a way of performing cluster analysis at multiple different scales simultaneously. In others, the data to be analyzed naturally has an unknown tree structure and the goal is to recover that structure by performing the analysis. Both of these kinds of analysis can be seen, for instance, in the application of hierarchical clustering to [[Taxonomy (biology)|biological taxonomy]]. In this application, different living things are grouped into clusters at different scales or levels of similarity ([[Taxonomic rank|species, genus, family, etc]]). This analysis simultaneously gives a multi-scale grouping of the organisms of the present age, and aims to accurately reconstruct the branching process or [[Phylogenetic tree|evolutionary tree]] that in past ages produced these organisms.<ref>{{citation
| last = Gordon | first = Allan D.
| editor1-last = Arabie | editor1-first = P.
| editor2-last = Hubert | editor2-first = L. J.
| editor3-last = De Soete | editor3-first = G.
| contribution = Hierarchical clustering
| contribution-url = https://books.google.com/books?id=HbfsCgAAQBAJ&pg=PA65
| isbn = 9789814504539
| ___location = River Edge, NJ
| pages = 65–121
| publisher = World Scientific
| title = Clustering and Classification
| year = 1996}}.</ref>
The input to a clustering problem consists of a set of points.<ref name="murtagh-tcj"/> A ''cluster'' is any proper subset of the points, and a hierarchical clustering is a [[maximal element|maximal]] family of clusters with the property that any two clusters in the family are either nested or [[disjoint set|disjoint]].
Alternatively, a hierarchical clustering may be represented as a [[binary tree]] with the points at its leaves; the clusters of the clustering are the sets of points in subtrees descending from each node of the tree.<ref>{{citation|title=Clustering|volume=10|series=IEEE Press Series on Computational Intelligence|first1=Rui|last1=Xu|first2=Don|last2=Wunsch|publisher=John Wiley & Sons|year=2008|isbn=978-0-470-38278-3|page=31|contribution-url=https://books.google.com/books?id=kYC3YCyl_tkC&pg=PA31|contribution=3.1 Hierarchical Clustering: Introduction}}.</ref>
Line 14 ⟶ 28:
The distance or dissimilarity should be symmetric: the distance between two points does not depend on which of them is considered first.
However, unlike the distances in a [[metric space]], it is not required to satisfy the [[triangle inequality]].<ref name="murtagh-tcj"/>
Next, the dissimilarity function is extended from pairs of points to pairs of clusters. Different clustering methods perform this extension in different ways. For instance, in the [[single-linkage clustering]] method, the distance between two clusters is defined to be the minimum distance between any two points from each cluster. Given this distance between clusters, a hierarchical clustering may be defined by a [[greedy algorithm]] that initially places each point in its own single-point cluster and then repeatedly forms a new cluster by merging the [[closest pair]] of clusters.<ref name="murtagh-tcj"/>
|