Consensus clustering: Difference between revisions

Content deleted Content added
m mark weird proper name
m task, replaced: journal=Journal on Machine Learning Research (JMLR) → journal=Journal on Machine Learning Research
 
(3 intermediate revisions by 3 users not shown)
Line 1:
{{Short description|Method of result aggregation from multiple clustering algorithms}}
'''Consensus clustering''' is a method of aggregating (potentially conflicting) results from multiple [[clustering algorithm]]s. Also called '''cluster ensembles'''<ref name=StrehlEnsembles>{{cite journal|last1=Strehl|first1=Alexander|authorlink1=Alexander Strehl|author2=Ghosh, Joydeep|title=Cluster ensembles – a knowledge reuse framework for combining multiple partitions|journal=Journal on Machine Learning Research (JMLR)|date=2002|volume=3|pages=583–617|url=http://www.jmlr.org/papers/volume3/strehl02a/strehl02a.pdf|doi=10.1162/153244303321897735|s2cid=3068944 |quote=This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. We first identify several application scenarios for the resultant 'knowledge reuse' framework that we call cluster ensembles. The cluster ensemble problem is then formalized as a combinatorial optimization problem in terms of shared [[mutual information]]}}</ref> or aggregation of clustering (or partitions), it refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings.<ref name=RuizSurvey2011>{{cite journal|last=VEGA-PONS|first=SANDRO|author2=RUIZ-SHULCLOPER, JOSÉ|s2cid=4643842|journal=International Journal of Pattern Recognition and Artificial Intelligence|date=1 May 2011|volume=25|issue=3|pages=337–372|doi=10.1142/S0218001411008683|title=A Survey of Clustering Ensemble Algorithms}}</ref> Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be [[NP-complete]],<ref name=Filkov2003>{{cite book|last=Filkov|first=Vladimir|title=Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence|chapter=Integrating microarray data by consensus clustering|year=2003|pages=418–426|doi=10.1109/TAI.2003.1250220|isbn=978-0-7695-2038-4|citeseerx=10.1.1.116.8271|s2cid=1515525}}</ref> even when the number of input clusterings is three.<ref name=Bonizzoni2008>{{cite journal|last=Bonizzoni|first=Paola|author2=Della Vedova, Gianluca| author3= Dondi, Riccardo| author4= Jiang, Tao| title=On the Approximation of Correlation Clustering and Consensus Clustering|journal=Journal of Computer and System Sciences|volume=74|number=5|year=2008|pages=671–696|doi=10.1016/j.jcss.2007.06.024|doi-access=free}}</ref> Consensus clustering for unsupervised learning is analogous to [[ensemble learning]] in supervised learning.
 
==Issues with existing clustering techniques==
Line 27 ⟶ 28:
==Over-interpretation potential of the Monti consensus clustering algorithm==
[[File:PACexplained.png|400px|thumb|PAC measure (proportion of ambiguous clustering) explained. Optimal K is the K with lowest PAC value.]]
Monti consensus clustering can be a powerful tool for identifying clusters, but it needs to be applied with caution as shown by Şenbabaoğlu ''et al.'' <ref name="SenbabaogluSREP" /> It has been shown that the Monti consensus clustering algorithm is able to claim apparent stability of chance partitioning of null datasets drawn from a unimodal distribution, and thus has the potential to lead to over-interpretation of cluster stability in a real study.<ref name=SenbabaogluSREP>{{cite journal|last=Şenbabaoğlu|first=Y.|author2=Michailidis, G. |author3=Li, J. Z. |title=Critical limitations of consensus clustering in class discovery|journal=Scientific Reports|date=2014|doi=10.1038/srep06207|volume=4|pages=6207|pmid=25158761|pmc=4145288|bibcode=2014NatSR...4E62074.6207.}}</ref><ref name=SenbabaogluRXV>{{cite bioRxiv|last=Şenbabaoğlu|first=Y.|author2=Michailidis, G. |author3=Li, J. Z. |title=A reassessment of consensus clustering for class discovery|date=Feb 2014|biorxiv=10.1101/002642}}</ref> If clusters are not well separated, consensus clustering could lead one to conclude apparent structure when there is none, or declare cluster stability when it is subtle. Identifying false positive clusters is a common problem throughout cluster research,<ref name=":0">{{Cite journal|last1=Liu|first1=Yufeng|last2=Hayes|first2=David Neil|last3=Nobel|first3=Andrew|last4=Marron|first4=J. S.|date=2008-09-01|title=Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data|journal=Journal of the American Statistical Association|volume=103|issue=483|pages=1281–1293|doi=10.1198/016214508000000454|s2cid=120819441|issn=0162-1459}}</ref> and has been addressed by methods such as SigClust<ref name=":0" /> and the GAP-statistic.<ref>{{Cite journal|last1=Tibshirani|first1=Robert|last2=Walther|first2=Guenther|last3=Hastie|first3=Trevor|date=2001|title=Estimating the number of clusters in a data set via the gap statistic|journal=Journal of the Royal Statistical Society, Series B (Statistical Methodology)|language=en|volume=63|issue=2|pages=411–423|doi=10.1111/1467-9868.00293|s2cid=59738652 |issn=1467-9868|doi-access=free}}</ref> However, these methods rely on certain assumptions for the null model that may not always be appropriate.
 
Şenbabaoğlu ''et al'' <ref name="SenbabaogluSREP" /> demonstrated the original delta K metric to decide <math>K</math> in the Monti algorithm performed poorly, and proposed a new superior metric for measuring the stability of consensus matrices using their CDF curves. In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represent those with ambiguous assignments in different clustering runs. The proportion of ambiguous clustering (PAC) score measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u<sub>1</sub>, u<sub>2</sub>) ∈ [0, 1] where u<sub>1</sub> is a value close to 0 and u<sub>2</sub> is a value close to 1 (for instance u<sub>1</sub>=0.1 and u<sub>2</sub>=0.9). A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs. One can therefore infer the optimal number of clusters by the <math>K</math> value having the lowest PAC.<ref name="SenbabaogluSREP" /><ref name="SenbabaogluRXV" />
Line 33 ⟶ 34:
==Related work==
#'''Clustering ensemble (Strehl and Ghosh)''': They considered various formulations for the problem, most of which reduce the problem to a [[hyper-graph]] partitioning problem. In one of their formulations they considered the same graph as in the correlation clustering problem. The solution they proposed is to compute the best ''k''-partition of the graph, which does not take into account the penalty for merging two nodes that are far apart.<ref name=StrehlEnsembles/>
#'''Clustering aggregation (Fern and Brodley)''': They applied the clustering aggregation idea to a collection of [[soft clustering]]s they obtained by random projections. They used an agglomerative algorithm and did not penalize for merging dissimilar nodes.<ref>{{cite journal|author1=Fern, Xiaoli |author2= Brodley, Carla|year=2004|title=Cluster ensembles for high dimensional clustering: an empirical study|journal=J Mach Learn Res|volume=22|url=https://www.researchgate.net/publication/228476517}} </ref>
#'''Fred and Jain''': They proposed to use a single linkage algorithm to combine multiple runs of the ''k''-means algorithm.<ref name="Fred Jain 2005 pp. 835–850">{{cite journal | last1=Fred | first1=Ana L.N. | last2=Jain | first2=Anil K. | title=Combining multiple clusterings using evidence accumulation | journal=IEEE Transactions on Pattern Analysis and Machine Intelligence | publisher=Institute of Electrical and Electronics Engineers (IEEE) | volume=27 | issue=6 | year=2005 | issn=0162-8828 | doi=10.1109/tpami.2005.113 | pages=835–850|pmid= 15943417| s2cid=10316033 |url=http://dataclustering.cse.msu.edu/papers/TPAMI-0239-0504.R1.pdf}}</ref>
#'''Dana Cristofor and Dan Simovici''': They observed the connection between clustering aggregation and clustering of [[categorical variable|categorical data]]. They proposed information theoretic distance measures, and they propose [[genetic algorithm]]s for finding the best aggregation solution.<ref>{{cite journal|author=Dana Cristofor, Dan Simovici|title=Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms|journal=Journal of Universal Computer Science|volume=8|issue=2|pages=153–172|url=https://www.jucs.org/jucs_8_2/finding_median_partitions_using/Cristofor_D.pdf|date=February 2002|doi=10.3217/jucs-008-02-0153}}</ref>
Line 59 ⟶ 60:
#* Compete for Objects
#'''{{Proper name|sHBGF}}''':represents the ensemble as a [[bipartite graph]] with clusters and instances as nodes, and edges between the instances and the clusters they belong to.<ref>Solving cluster ensemble problems by bipartite graph partitioning, Xiaoli Zhang Fern and [[Carla Brodley]], Proceedings of the twenty-first international conference on Machine learning</ref> This approach can be trivially adapted to consider soft ensembles since the graph partitioning algorithm METIS accepts weights on the edges of the graph to be partitioned. In sHBGF, the graph has ''n''&nbsp;+&nbsp;''t'' vertices, where t is the total number of underlying clusters.
#'''Bayesian consensus clustering (BCC)''': defines a fully [[Bayesian probability|Bayesian]] model for soft consensus clustering in which multiple source clusterings, defined by different input data or different probability models, are assumed to adhere loosely to a consensus clustering.<ref name=LockBCC>{{cite journal|last=Lock|first=E.F.|author2=Dunson, D.B. |title=Bayesian consensus clustering|journal=Bioinformatics|date=2013|doi=10.1093/bioinformatics/btt425|pmid=23990412|pmc=3789539|volume=29|number=20|pages=2610–2616|arxiv=1302.7280|bibcode=2013arXiv13022013Bioin.7280L.29.2610L}}</ref> The full posterior for the separate clusterings, and the consensus clustering, are inferred simultaneously via [[Gibbs sampling]].
#'''Ensemble Clustering Fuzzification Means (ECF-Means)''': ECF-means is a clustering algorithm, which combines different clustering results in ensemble, achieved by different runs of a chosen algorithm ([[k-means]]), into a single final clustering configuration.<ref name=ZazzECF>{{cite journal|last=Zazzaro|first=Gaetano|author2=Martone, Angelo |title=ECF-means - Ensemble Clustering Fuzzification Means. A novel algorithm for clustering aggregation, fuzzification, and optimization |journal=IMM 2018: The Eighth International Conference on Advances in Information Mining and Management|date=2018}} [https://www.thinkmind.org/articles/immm_2018_2_10_50010.pdf]</ref>
 
Line 71 ⟶ 72:
 
[[Category:Cluster analysis]]
[[Category:NP-complete problems]]