Consensus clustering: Difference between revisions

Content deleted Content added
Refined the PAC score section. Too much detail on this one paper.
Line 27:
==Over-interpretation potential of the Monti consensus clustering algorithm==
[[File:PACexplained.png|400px|thumb|PAC measure (proportion of ambiguous clustering) explained. Optimal K is the K with lowest PAC value.]]
Monti consensus clusteringsclustering can be a powerful tool for identifying clusters, but it needs to be applied with caution as shown by Şenbabaoğlu ''et al.'' <ref name="SenbabaogluSREP" /> It has been shown that the Monti consensus clustering algorithm is able to claim apparent stability of chance partitioning of null datasets drawn from a unimodal distribution, and thus has the potential to lead to over-interpretation of cluster stability in a real study.<ref name=SenbabaogluSREP>{{cite journal|last=Şenbabaoğlu|first=Y.|author2=Michailidis, G. |author3=Li, J. Z. |title=Critical limitations of consensus clustering in class discovery|journal=Scientific Reports|date=2014|doi=10.1038/srep06207|volume=4|pages=6207|pmid=25158761|pmc=4145288|bibcode=2014NatSR...4E6207.}}</ref><ref name=SenbabaogluRXV>{{cite biorxiv|last=Şenbabaoğlu|first=Y.|author2=Michailidis, G. |author3=Li, J. Z. |title=A reassessment of consensus clustering for class discovery|date=Feb 2014|biorxiv=002642}}</ref> If clusters are not well separated, consensus clustering could lead one to conclude apparent structure when there is none, or declare cluster stability when it is subtle. Identifying false positive clusters is a common problem throughout cluster research,<ref name=":0">{{Cite journal|last=Liu|first=Yufeng|last2=Hayes|first2=David Neil|last3=Nobel|first3=Andrew|last4=Marron|first4=J. S.|date=2008-09-01|title=Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data|journal=Journal of the American Statistical Association|volume=103|issue=483|pages=1281–1293|doi=10.1198/016214508000000454|issn=0162-1459}}</ref> and has been addressed by methods such as SigClust<ref name=":0" /> and the GAP-statistic.<ref>{{Cite journal|last=Tibshirani|first=Robert|last2=Walther|first2=Guenther|last3=Hastie|first3=Trevor|date=2001|title=Estimating the number of clusters in a data set via the gap statistic|journal=Journal of the Royal Statistical Society: Series B (Statistical Methodology)|language=en|volume=63|issue=2|pages=411–423|doi=10.1111/1467-9868.00293|issn=1467-9868}}</ref> However, these methods rely on certain assumptions for the null model that may not always be appropiate.
 
To reduce the false positive potential in clustering samples (observations), Şenbabaoğlu ''et al'' <ref name="SenbabaogluSREP" /> recommends (1) doing a formal test of cluster strength using simulated unimodal data with the same [[feature (machine learning)|feature space]] correlation structure as in the empirical data, (2) not relying solely on the consensus matrix heatmap to declare the existence of clusters, or to estimate optimal <math>K</math>, (3) applying the proportion of ambiguous clustering (PAC) as a simple yet powerful method to infer optimal <math>K</math>.