Revision as of 18:19, 18 April 2020 edit CitationCleanerBot (talk \| contribs) Bots 66,146 edits m clean up, replaced: biorxiv=002642 → biorxiv=10.1101/002642 Tag: AWB ← Previous edit		Revision as of 02:02, 6 June 2020 edit undo Citation bot (talk \| contribs) Bots 5,865,525 edits Add: s2cid, author pars. 1-1. Removed URL that duplicated unique identifier. Removed parameters. Some additions/deletions were actually parameter name changes. \| You can use this bot yourself. Report bugs here. \| Activated by AManWithNoPlan \| All pages linked from User:AManWithNoPlan/sandbox2 \| via #UCB_webform_linked Next edit →
Line 1: '''Consensus clustering''' is an important elaboration of traditional [[cluster analysis]]. Consensus clustering, also called '''cluster ensembles'''<ref name=StrehlEnsembles>{{cite journal\|last=Strehl\|first=Alexander\|author2=Ghosh, Joydeep\|title=Cluster ensembles – a knowledge reuse framework for combining multiple partitions\|journal=Journal on Machine Learning Research (JMLR)\|date=2002\|volume=3\|pages=583–617\|url=http://www.jmlr.org/papers/volume3/strehl02a/strehl02a.pdf}}</ref> or aggregation of clustering (or partitions), refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings.<ref name=RuizSurvey2011>{{cite journal\|last=VEGA-PONS\|first=SANDRO\|author2=RUIZ-SHULCLOPER, JOSÉ\|s2cid=4643842\|journal=International Journal of Pattern Recognition and Artificial Intelligence\|date=1 May 2011\|volume=25\|issue=3\|pages=337–372\|doi=10.1142/S0218001411008683\|title=A Survey of Clustering Ensemble Algorithms~~\|url=https://semanticscholar.org/paper/0d1b7d01fb2634b6160a96bbdd73f918ed3859cb~~}}</ref> Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be [[NP-complete]],<ref name=Filkov2003>{{cite book\|last=Filkov\|first=Vladimir\|title=Integrating microarray data by consensus clustering\|journal=In Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence.\|year=2003\|pages=418–426\|doi=10.1109/TAI.2003.1250220\|isbn=978-0-7695-2038-4\|citeseerx=10.1.1.116.8271}}</ref> even when the number of input clusterings is three.<ref name=Bonizzoni2008>{{cite journal\|last=Bonizzoni\|first=Paola\|author2=Della Vedova, Gianluca\| author3= Dondi, Riccardo\| author4= Jiang, Tao\| title=On the Approximation of Correlation Clustering and Consensus Clustering\|journal=Journal of Computer and System Sciences\|volume=74\|number=5\|year=2008\|pages=671–696\|doi=10.1016/j.jcss.2007.06.024}}</ref> Consensus clustering for unsupervised learning is analogous to [[ensemble learning]] in supervised learning. ==Issues with existing clustering techniques== Line 13: ==The Monti consensus clustering algorithm== The Monti consensus clustering algorithm<ref>{{Cite journal\|~~last~~last1=Monti\|~~first~~first1=Stefano\|last2=Tamayo\|first2=Pablo\|last3=Mesirov\|first3=Jill\|last4=Golub\|first4=Todd\|date=2003-07-01\|title=Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data\|journal=Machine Learning\|language=en\|volume=52\|issue=1\|pages=91–118\|doi=10.1023/A:1023949509487\|issn=1573-0565\|doi-access=free}}</ref> is one of the most popular consensus clustering algorithms and is used to determine the number of clusters, <math>K</math>. Given a dataset of <math>N</math> total number of points to cluster, this algorithm works by resampling and clustering the data, for each <math>K</math> and a <math>NXN</math> consensus matrix is calculated, where each element represents the fraction of times two samples clustered together. A perfectly stable matrix would consist entirely of zeros and ones, representing all sample pairs always clustering together or not together over all resampling iterations. The relative stability of the consensus matrices can be used to infer the optimal <math>K</math>. More specifically, given a set of points to cluster, <math>D=\{e_1,e_2,...e_N\}</math>, let <math>D^1,D^2,...,D^H</math> be the list of <math>H</math> pertubed (resampled) datasets of the original dataset <math>D</math>, and let <math>M^h</math> denote the <math>NXN</math> connectivity matrix resulting from applying a clustering algorithm to the dataset <math>D^h</math>. The entries of <math>M^h</math> are defined as follows: Line 27: ==Over-interpretation potential of the Monti consensus clustering algorithm== [[File:PACexplained.png\|400px\|thumb\|PAC measure (proportion of ambiguous clustering) explained. Optimal K is the K with lowest PAC value.]] Monti consensus clustering can be a powerful tool for identifying clusters, but it needs to be applied with caution as shown by Şenbabaoğlu ''et al.'' <ref name="SenbabaogluSREP" /> It has been shown that the Monti consensus clustering algorithm is able to claim apparent stability of chance partitioning of null datasets drawn from a unimodal distribution, and thus has the potential to lead to over-interpretation of cluster stability in a real study.<ref name=SenbabaogluSREP>{{cite journal\|last=Şenbabaoğlu\|first=Y.\|author2=Michailidis, G. \|author3=Li, J. Z. \|title=Critical limitations of consensus clustering in class discovery\|journal=Scientific Reports\|date=2014\|doi=10.1038/srep06207\|volume=4\|pages=6207\|pmid=25158761\|pmc=4145288\|bibcode=2014NatSR...4E6207.}}</ref><ref name=SenbabaogluRXV>{{cite biorxiv\|last=Şenbabaoğlu\|first=Y.\|author2=Michailidis, G. \|author3=Li, J. Z. \|title=A reassessment of consensus clustering for class discovery\|date=Feb 2014\|biorxiv=10.1101/002642}}</ref> If clusters are not well separated, consensus clustering could lead one to conclude apparent structure when there is none, or declare cluster stability when it is subtle. Identifying false positive clusters is a common problem throughout cluster research,<ref name=":0">{{Cite journal\|~~last~~last1=Liu\|~~first~~first1=Yufeng\|last2=Hayes\|first2=David Neil\|last3=Nobel\|first3=Andrew\|last4=Marron\|first4=J. S.\|date=2008-09-01\|title=Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data\|journal=Journal of the American Statistical Association\|volume=103\|issue=483\|pages=1281–1293\|doi=10.1198/016214508000000454\|issn=0162-1459}}</ref> and has been addressed by methods such as SigClust<ref name=":0" /> and the GAP-statistic.<ref>{{Cite journal\|~~last~~last1=Tibshirani\|~~first~~first1=Robert\|last2=Walther\|first2=Guenther\|last3=Hastie\|first3=Trevor\|date=2001\|title=Estimating the number of clusters in a data set via the gap statistic\|journal=Journal of the Royal Statistical Society: Series B (Statistical Methodology)\|language=en\|volume=63\|issue=2\|pages=411–423\|doi=10.1111/1467-9868.00293\|issn=1467-9868}}</ref> However, these methods rely on certain assumptions for the null model that may not always be appropiate. Şenbabaoğlu ''et al'' <ref name="SenbabaogluSREP" /> demonstrated the original delta K metric to decide <math>K</math> in the Monti algorithm performed poorly, and proposed a new superior metric for measuring the stability of consensus matrices using their CDF curves. In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represent those with ambiguous assignments in different clustering runs. The proportion of ambiguous clustering (PAC) score measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u<sub>1</sub>, u<sub>2</sub>) ∈ [0, 1] where u<sub>1</sub> is a value close to 0 and u<sub>2</sub> is a value close to 1 (for instance u<sub>1</sub>=0.1 and u<sub>2</sub>=0.9). A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs. One can therefore infer the optimal number of clusters by the <math>K</math> value having the lowest PAC.<ref name="SenbabaogluSREP" /><ref name="SenbabaogluRXV" /> Line 50: '''1. Cluster-based similarity partitioning algorithm (CSPA)''' In CSPA the similarity between two data-points is defined to be directly proportional to number of constituent clusterings of the ensemble in which they are clustered together. The intuition is that the more similar two data-points are the higher is the chance that constituent clusterings will place them in the same cluster. CSPA is the simplest heuristic, but its computational and storage complexity are both quadratic in ''n''. [http://bioconductor.org/packages/release/bioc/html/SC3.html SC3] is an example of a CSPA type algorithm.<ref>{{Cite journal\|~~last~~last1=Kiselev\|~~first~~first1=Vladimir Yu\|last2=Kirschner\|first2=Kristina\|last3=Schaub\|first3=Michael T\|last4=Andrews\|first4=Tallulah\|last5=Yiu\|first5=Andrew\|last6=Chandra\|first6=Tamir\|last7=Natarajan\|first7=Kedar N\|last8=Reik\|first8=Wolf\|last9=Barahona\|first9=Mauricio\|last10=Green\|first10=Anthony R\|last11=Hemberg\|first11=Martin\|date=May 2017\|title=SC3: consensus clustering of single-cell RNA-seq data~~\|url=http://www.nature.com/articles/nmeth.4236~~\|journal=Nature Methods\|language=en\|volume=14\|issue=5\|pages=483–486\|doi=10.1038/nmeth.4236\|issn=1548-7091\|pmc=5410170\|pmid=28346451}}</ref> The following two methods are computationally less expensive: '''2. Hyper-graph partitioning algorithm (HGPA)'''

Consensus clustering: Difference between revisions