Revision as of 07:09, 30 March 2022 edit Rhubarbeer (talk \| contribs) 3 edits m →The Monti consensus clustering algorithm: Small math typo Tag: Visual edit ← Previous edit		Revision as of 13:07, 11 April 2022 edit undo Kku (talk \| contribs) Extended confirmed users 121,760 edits m link categorical data using Find link, wikified Next edit →
Line 1: '''Consensus clustering''' is a method of aggregating (potentially conflicting) results from multiple [[clustering algorithm]]s. Also called '''cluster ensembles'''<ref name=StrehlEnsembles>{{cite journal\|~~last~~last1=Strehl\|~~first~~first1=Alexander\|authorlink1=Alexander Strehl\|author2=Ghosh, Joydeep\|title=Cluster ensembles – a knowledge reuse framework for combining multiple partitions\|journal=Journal on Machine Learning Research (JMLR)\|date=2002\|volume=3\|pages=583–617\|url=http://www.jmlr.org/papers/volume3/strehl02a/strehl02a.pdf\|doi=10.1162/153244303321897735\|quote=This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. We first identify several application scenarios for the resultant 'knowledge reuse' framework that we call cluster ensembles. The cluster ensemble problem is then formalized as a combinatorial optimization problem in terms of shared mutual information}}</ref> or aggregation of clustering (or partitions), it refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings.<ref name=RuizSurvey2011>{{cite journal\|last=VEGA-PONS\|first=SANDRO\|author2=RUIZ-SHULCLOPER, JOSÉ\|s2cid=4643842\|journal=International Journal of Pattern Recognition and Artificial Intelligence\|date=1 May 2011\|volume=25\|issue=3\|pages=337–372\|doi=10.1142/S0218001411008683\|title=A Survey of Clustering Ensemble Algorithms}}</ref> Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be [[NP-complete]],<ref name=Filkov2003>{{cite book\|last=Filkov\|first=Vladimir\|title=Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence\|chapter=Integrating microarray data by consensus clustering\|journal=In Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence.\|year=2003\|pages=418–426\|doi=10.1109/TAI.2003.1250220\|isbn=978-0-7695-2038-4\|citeseerx=10.1.1.116.8271\|s2cid=1515525}}</ref> even when the number of input clusterings is three.<ref name=Bonizzoni2008>{{cite journal\|last=Bonizzoni\|first=Paola\|author2=Della Vedova, Gianluca\| author3= Dondi, Riccardo\| author4= Jiang, Tao\| title=On the Approximation of Correlation Clustering and Consensus Clustering\|journal=Journal of Computer and System Sciences\|volume=74\|number=5\|year=2008\|pages=671–696\|doi=10.1016/j.jcss.2007.06.024\|doi-access=free}}</ref> Consensus clustering for unsupervised learning is analogous to [[ensemble learning]] in supervised learning. ==Issues with existing clustering techniques== * Current clustering techniques do not address all the requirements adequately. * Dealing with large number of dimensions and large number of data items can be problematic because of time complexity; * Effectiveness of the method depends on the definition of "[[distance]]" (for distance-based clustering) * If an obvious distance measure doesn't exist, we must "define" it, which is not always easy, especially in multidimensional spaces. * The result of the clustering algorithm (that, in many cases, can be arbitrary itself) can be interpreted in different ways. Line 32: ==Related work== 1. #'''Clustering ensemble (Strehl and Ghosh)''': They considered various formulations for the problem, most of which reduce the problem to a [[hyper-graph]] partitioning problem. In one of their formulations they considered the same graph as in the correlation clustering problem. The solution they proposed is to compute the best ''k''-partition of the graph, which does not take into account the penalty for merging two nodes that are far apart.~~{{citation~~<ref ~~needed\|date~~name=~~July 2020}}~~StrehlEnsembles/> #'''Clustering aggregation (Fern and Brodley)''': They applied the clustering aggregation idea to a collection of [[soft clustering]]s they obtained by random projections. They used an agglomerative algorithm and did not penalize for merging dissimilar nodes.{{citation needed\|date=July 2020}} 2. #'''~~Clustering aggregation (Fern~~Fred and ~~Brodley)~~Jain''': They ~~applied~~proposed ~~the~~to ~~clustering aggregation idea to~~use a ~~collection~~single oflinkage ~~[[soft~~algorithm ~~clustering]]s~~to ~~they~~combine ~~obtained~~multiple byruns ~~random~~of ~~projections.~~the ~~They used an agglomerative~~''k''-means algorithm ~~and did not penalize for merging dissimilar nodes~~.{{citation needed\|date=July 2020}} 4. #'''Dana Cristofor and Dan Simovici''': They observed the connection between clustering aggregation and clustering of [[categorical variable\|categorical data]]. They proposed information theoretic distance measures, and they propose [[genetic algorithm]]s for finding the best aggregation solution.{{citation needed\|date=July 2020}}▼ #'''Topchy et al.''': They defined clustering aggregation as a maximum likelihood estimation problem, and they proposed an [[EM algorithm]] for finding the consensus clustering.<ref>Alexander Topchy, Anil K. Jain, William Punch. [http://dataclustering.cse.msu.edu/papers/TPAMI-ClusteringEnsembles.pdf Clustering Ensembles: Models of Consensus and Weak Partitions]. IEEE International Conference on Data Mining, ICDM 03 & SIAM International Conference on Data Mining, SDM 04</ref>▼ ~~3. '''Fred and Jain''': They proposed to use a single linkage algorithm to combine multiple runs of the ''k''-means algorithm.{{citation needed\|date=July 2020}}~~ ▲4. '''Dana Cristofor and Dan Simovici''': They observed the connection between clustering aggregation and clustering of categorical data. They proposed information theoretic distance measures, and they propose [[genetic algorithm]]s for finding the best aggregation solution.{{citation needed\|date=July 2020}} 5. '''Topchy et al.''': They defined clustering aggregation as a maximum likelihood estimation problem, and they proposed an EM algorithm for finding the consensus clustering.{{citation needed\|date=July 2020}} == Hard ensemble clustering == Line 48 ⟶ 44: === Efficient consensus functions === #'''Cluster-based similarity partitioning algorithm (CSPA)''':In CSPA the similarity between two data-points is defined to be directly proportional to number of constituent clusterings of the ensemble in which they are clustered together. The intuition is that the more similar two data-points are the higher is the chance that constituent clusterings will place them in the same cluster. CSPA is the simplest heuristic, but its computational and storage complexity are both quadratic in ''n''. [http://bioconductor.org/packages/release/bioc/html/SC3.html SC3] is an example of a CSPA type algorithm.<ref>{{Cite journal\|last1=Kiselev\|first1=Vladimir Yu\|last2=Kirschner\|first2=Kristina\|last3=Schaub\|first3=Michael T\|last4=Andrews\|first4=Tallulah\|last5=Yiu\|first5=Andrew\|last6=Chandra\|first6=Tamir\|last7=Natarajan\|first7=Kedar N\|last8=Reik\|first8=Wolf\|last9=Barahona\|first9=Mauricio\|last10=Green\|first10=Anthony R\|last11=Hemberg\|first11=Martin\|date=May 2017\|title=SC3: consensus clustering of single-cell RNA-seq data\|journal=Nature Methods\|language=en\|volume=14\|issue=5\|pages=483–486\|doi=10.1038/nmeth.4236\|issn=1548-7091\|pmc=5410170\|pmid=28346451}}</ref> The following two methods are computationally less expensive:▼ ~~'''1. Cluster-based similarity partitioning algorithm (CSPA)'''~~ #'''Hyper-graph partitioning algorithm (HGPA)''': The HGPA algorithm takes a very different approach to finding the consensus clustering than the previous method. The cluster ensemble problem is formulated as partitioning the hypergraph by cutting a minimal number of hyperedges. They make use of [http://glaros.dtc.umn.edu/gkhome/metis/hmetis/overview hMETIS] which is a hypergraph partitioning package system.▼ #'''Meta-clustering algorithm (MCLA)''':The meta-cLustering algorithm (MCLA) is based on clustering clusters. First, it tries to solve the cluster correspondence problem and then uses voting to place data-points into the final consensus clusters. The cluster correspondence problem is solved by grouping the clusters identified in the individual clusterings of the ensemble. The clustering is performed using [http://glaros.dtc.umn.edu/gkhome/views/metis METIS] and Spectral clustering.▼ ▲In CSPA the similarity between two data-points is defined to be directly proportional to number of constituent clusterings of the ensemble in which they are clustered together. The intuition is that the more similar two data-points are the higher is the chance that constituent clusterings will place them in the same cluster. CSPA is the simplest heuristic, but its computational and storage complexity are both quadratic in ''n''. [http://bioconductor.org/packages/release/bioc/html/SC3.html SC3] is an example of a CSPA type algorithm.<ref>{{Cite journal\|last1=Kiselev\|first1=Vladimir Yu\|last2=Kirschner\|first2=Kristina\|last3=Schaub\|first3=Michael T\|last4=Andrews\|first4=Tallulah\|last5=Yiu\|first5=Andrew\|last6=Chandra\|first6=Tamir\|last7=Natarajan\|first7=Kedar N\|last8=Reik\|first8=Wolf\|last9=Barahona\|first9=Mauricio\|last10=Green\|first10=Anthony R\|last11=Hemberg\|first11=Martin\|date=May 2017\|title=SC3: consensus clustering of single-cell RNA-seq data\|journal=Nature Methods\|language=en\|volume=14\|issue=5\|pages=483–486\|doi=10.1038/nmeth.4236\|issn=1548-7091\|pmc=5410170\|pmid=28346451}}</ref> The following two methods are computationally less expensive: ~~'''2. Hyper-graph partitioning algorithm (HGPA)'''~~ ~~The HGPA algorithm takes a very different approach to finding the consensus clustering than the previous method.~~ ▲The cluster ensemble problem is formulated as partitioning the hypergraph by cutting a minimal number of hyperedges. They make use of [http://glaros.dtc.umn.edu/gkhome/metis/hmetis/overview hMETIS] which is a hypergraph partitioning package system. ~~'''3. Meta-clustering algorithm (MCLA)'''~~ ~~The meta-cLustering algorithm (MCLA) is based on clustering clusters.~~ ▲First, it tries to solve the cluster correspondence problem and then uses voting to place data-points into the final consensus clusters. The cluster correspondence problem is solved by grouping the clusters identified in the individual clusterings of the ensemble. ~~The clustering is performed using [http://glaros.dtc.umn.edu/gkhome/views/metis METIS] and Spectral clustering.~~ == Soft clustering ensembles == ''Punera'' and ''Ghosh'' extended the idea of hard clustering ensembles to the soft clustering scenario. Each instance in a soft ensemble is represented by a concatenation of ''r'' posterior membership probability distributions obtained from the constituent clustering algorithms. We can define a distance measure between two instances using the [[Kullback–Leibler divergence\|Kullback–Leibler (KL) divergence]], which calculates the "distance" between two probability distributions.<ref>Kunal Punera, Joydeep Ghosh. [https://web.archive.org/web/20081201150950/http://www.ideal.ece.utexas.edu/papers/2007/punera07softconsensus.pdf Consensus Based Ensembles of Soft Clusterings]</ref> #'''sCSPA''': extends CSPA by calculating a similarity matrix. Each object is visualized as a point in dimensional space, with each dimension corresponding to probability of its belonging to a cluster. This technique first transforms the objects into a label-space and then interprets the dot product between the vectors representing the objects as their similarity.▼ ~~'''1. sCSPA'''~~ #'''sMCLA ''':extends MCLA by accepting soft clusterings as input. sMCLA's working can be divided into the following steps:▼ # Construct Soft Meta-Graph of Clusters▼ #* Group the Clusters into Meta-Clusters▼ #* Collapse Meta-Clusters using Weighting▼ #* Compete for Objects▼ ~~HBGF~~ #'''sHBGF''':represents the ensemble as a bipartite graph with clusters and instances as nodes, and edges between the instances and the clusters they belong to.<ref>Solving cluster ensemble problems by bipartite graph partitioning, Xiaoli Zhang Fern and [[Carla Brodley]], Proceedings of the twenty-first international conference on Machine learning</ref> This approach can be trivially adapted to consider soft ensembles since the graph partitioning algorithm METIS accepts weights on the edges of the graph to be partitioned. In sHBGF, the graph has ''n'' + ''t'' vertices, where t is the total number of underlying clusters.▼ #'''Bayesian consensus clustering (BCC)''': defines a fully [[Bayesian probability\|Bayesian]] model for soft consensus clustering in which multiple source clusterings, defined by different input data or different probability models, are assumed to adhere loosely to a consensus clustering.<ref name=LockBCC>{{cite journal\|last=Lock\|first=E.F.\|author2=Dunson, D.B. \|title=Bayesian consensus clustering\|journal=Bioinformatics\|date=2013\|doi=10.1093/bioinformatics/btt425\|pmid=23990412\|pmc=3789539\|volume=29\|number=20\|pages=2610–2616\|arxiv=1302.7280\|bibcode=2013arXiv1302.7280L}}</ref> The full posterior for the separate clusterings, and the consensus clustering, are inferred simultaneously via Gibbs sampling.▼ == References ==▼ ▲sCSPA extends CSPA by calculating a similarity matrix. Each object is visualized as a point in dimensional space, with each dimension corresponding to probability of its belonging to a cluster. This technique first transforms the objects into a label-space and then interprets the dot product between the vectors representing the objects as their similarity. ~~'''2. sMCLA'''~~ ▲sMCLA extends MCLA by accepting soft clusterings as input. sMCLA's working can be divided into the following steps: ▲* Construct Soft Meta-Graph of Clusters ▲* Group the Clusters into Meta-Clusters ▲* Collapse Meta-Clusters using Weighting ▲* Compete for Objects ~~'''3. sHBGF'''~~ ▲HBGF represents the ensemble as a bipartite graph with clusters and instances as nodes, and edges between the instances and the clusters they belong to.<ref>Solving cluster ensemble problems by bipartite graph partitioning, Xiaoli Zhang Fern and [[Carla Brodley]], Proceedings of the twenty-first international conference on Machine learning</ref> This approach can be trivially adapted to consider soft ensembles since the graph partitioning algorithm METIS accepts weights on the edges of the graph to be partitioned. In sHBGF, the graph has ''n'' + ''t'' vertices, where t is the total number of underlying clusters. ~~'''4. Bayesian consensus clustering (BCC)'''~~ ▲BCC defines a fully [[Bayesian probability\|Bayesian]] model for soft consensus clustering in which multiple source clusterings, defined by different input data or different probability models, are assumed to adhere loosely to a consensus clustering.<ref name=LockBCC>{{cite journal\|last=Lock\|first=E.F.\|author2=Dunson, D.B. \|title=Bayesian consensus clustering\|journal=Bioinformatics\|date=2013\|doi=10.1093/bioinformatics/btt425\|pmid=23990412\|pmc=3789539\|volume=29\|number=20\|pages=2610–2616\|arxiv=1302.7280\|bibcode=2013arXiv1302.7280L}}</ref> The full posterior for the separate clusterings, and the consensus clustering, are inferred simultaneously via Gibbs sampling. ~~==Sources==~~ <references /> ▲== References == * [[Alexander Strehl]] and J. Ghosh, [http://strehl.com/download/strehl-jmlr02.pdf Cluster ensembles – a knowledge reuse framework for combining multiple partitions], Journal on Machine Learning Research (JMLR) 2002, 3, 583–617. * Kunal Punera, Joydeep Ghosh. [https://web.archive.org/web/20081201150950/http://www.ideal.ece.utexas.edu/papers/2007/punera07softconsensus.pdf Consensus Based Ensembles of Soft Clusterings]. * Aristides Gionis, [[Heikki Mannila]], Panayiotis Tsaparas. [https://web.archive.org/web/20060828084525/http://www.cs.helsinki.fi/u/tsaparas/publications/aggregated-journal.pdf Clustering Aggregation]. 21st International Conference on Data Engineering (ICDE 2005) * Hongjun Wang, Hanhuai Shan, Arindam Banerjee. [http://www.siam.org/proceedings/datamining/2009/SDM09_022_wangh.pdf Bayesian Cluster Ensembles]{{Dead link\|date=November 2019 \|bot=InternetArchiveBot \|fix-attempted=yes }}, SIAM International Conference on Data Mining, SDM 09 * Nam Nguyen, Rich Caruana. Consensus Clusterings. Seventh IEEE International Conference on Data Mining. ▲* Alexander Topchy, Anil K. Jain, William Punch. [http://dataclustering.cse.msu.edu/papers/TPAMI-ClusteringEnsembles.pdf Clustering Ensembles: Models of Consensus and Weak Partitions]. IEEE International Conference on Data Mining, ICDM 03 & SIAM International Conference on Data Mining, SDM 04 [[Category:Cluster analysis]]

Consensus clustering: Difference between revisions