Consensus clustering: Difference between revisions

Content deleted Content added
grammar error
Line 13:
 
==The Monti consensus clustering algorithm==
The Monti consensus clustering algorithm<ref>{{Cite journal|last1=Monti|first1=Stefano|last2=Tamayo|first2=Pablo|last3=Mesirov|first3=Jill|last4=Golub|first4=Todd|date=2003-07-01|title=Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data|journal=Machine Learning|language=en|volume=52|issue=1|pages=91–118|doi=10.1023/A:1023949509487|issn=1573-0565|doi-access=free}}</ref> is one of the most popular consensus clustering algorithms and is used to determine the number of clusters, <math>K</math>. Given a dataset of <math>N</math> total number of points to cluster, this algorithm works by resampling and clustering the data, for each <math>K</math> and a <math>NXNN \times N</math> consensus matrix is calculated, where each element represents the fraction of times two samples clustered together. A perfectly stable matrix would consist entirely of zeros and ones, representing all sample pairs always clustering together or not together over all resampling iterations. The relative stability of the consensus matrices can be used to infer the optimal <math>K</math>.
 
More specifically, given a set of points to cluster, <math>D=\{e_1,e_2,...e_N\}</math>, let <math>D^1,D^2,...,D^H</math> be the list of <math>H</math> pertubed (resampled) datasets of the original dataset <math>D</math>, and let <math>M^h</math> denote the <math>NXN</math> connectivity matrix resulting from applying a clustering algorithm to the dataset <math>D^h</math>. The entries of <math>M^h</math> are defined as follows:
Line 19:
<math>M^h(i,j)= \begin{cases} 1, & \text{if}\text{ points i and j belong to the same cluster} \\ 0, & \text{otherwise} \end{cases}</math>
 
Let <math>I^h</math> be the <math>NXNN \times N</math> identicator matrix where the <math>(i,j)</math>-th entry is equal to 1 if points <math>i</math> and <math>j</math> are in the same perturbed dataset <math>D^h</math>, and 0 otherwise. The indicator matrix is used to keep track of which samples were selected during each resampling iteration for the normalisation step. The consensus matrix <math>C</math> is defined as the normalised sum of all connectivity matrices of all the perturbed datasets and a different one is calculated for every <math>K</math>.
 
<math>C(i,j)=\left ( \frac{\textstyle \sum_{h=1}^H M^h(i,j) \displaystyle}{\sum_{h=1}^H I^h(i,j)} \right )</math>