Content deleted Content added
GoingBatty (talk | contribs) clean up, typo(s) fixed: doesn’t → doesn't, appropiate → appropriate, ’s → 's, 583-617 → 583–617 |
m task, replaced: journal=Journal on Machine Learning Research (JMLR) → journal=Journal on Machine Learning Research |
||
(35 intermediate revisions by 16 users not shown) | |||
Line 1:
{{Short description|Method of result aggregation from multiple clustering algorithms}}
'''Consensus clustering''' is a method of aggregating (potentially conflicting) results from multiple [[clustering
==Issues with existing clustering techniques==
* Current clustering techniques do not address all the requirements adequately.
* Dealing with large number of dimensions and large number of data items can be problematic because of time complexity;
* Effectiveness of the method depends on the definition of "[[distance]]" (for distance-based clustering)
* If an obvious distance measure doesn't exist, we must "define" it, which is not always easy, especially in multidimensional spaces.
* The result of the clustering algorithm (that, in many cases, can be arbitrary itself) can be interpreted in different ways.
Line 13 ⟶ 14:
==The Monti consensus clustering algorithm==
The Monti consensus clustering algorithm<ref>{{Cite journal|last1=Monti|first1=Stefano|last2=Tamayo|first2=Pablo|last3=Mesirov|first3=Jill|last4=Golub|first4=Todd|date=2003-07-01|title=Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data|journal=Machine Learning|language=en|volume=52|issue=1|pages=91–118|doi=10.1023/A:1023949509487|issn=1573-0565|doi-access=free}}</ref> is one of the most popular consensus clustering algorithms and is used to determine the number of clusters, <math>K</math>. Given a dataset of <math>N</math> total number of points to cluster, this algorithm works by resampling and clustering the data, for each <math>K</math> and a <math>
More specifically, given a set of points to cluster, <math>D=\{e_1,e_2,...e_N\}</math>, let <math>D^1,D^2,...,D^H</math> be the list of <math>H</math>
<math>M^h(i,j)= \begin{cases} 1, & \text{if}\text{ points i and j belong to the same cluster} \\ 0, & \text{otherwise} \end{cases}</math>
Let <math>I^h</math> be the <math>
<math>C(i,j)=\left ( \frac{\textstyle \sum_{h=1}^H M^h(i,j) \displaystyle}{\sum_{h=1}^H I^h(i,j)} \right )</math>
That is the entry <math>(i,j)</math> in the consensus matrix is the number of times points <math>i</math> and <math>j</math> were clustered together divided by the total number of times they were selected together. The matrix is symmetric and each element is defined within the range <math>[0,1]</math>. A consensus matrix is calculated for each <math>K</math> to be tested, and the stability of each matrix, that is how far the matrix is towards a matrix of perfect stability (just zeros and ones) is used to determine the optimal <math>K</math>. One way of quantifying the stability of the <math>K</math>th consensus matrix is examining
==Over-interpretation potential of the Monti consensus clustering algorithm==
[[File:PACexplained.png|400px|thumb|PAC measure (proportion of ambiguous clustering) explained. Optimal K is the K with lowest PAC value.]]
Monti consensus clustering can be a powerful tool for identifying clusters, but it needs to be applied with caution as shown by Şenbabaoğlu ''et al.'' <ref name="SenbabaogluSREP" /> It has been shown that the Monti consensus clustering algorithm is able to claim apparent stability of chance partitioning of null datasets drawn from a unimodal distribution, and thus has the potential to lead to over-interpretation of cluster stability in a real study.<ref name=SenbabaogluSREP>{{cite journal|last=Şenbabaoğlu|first=Y.|author2=Michailidis, G. |author3=Li, J. Z. |title=Critical limitations of consensus clustering in class discovery|journal=Scientific Reports|date=2014|doi=10.1038/srep06207|volume=4|pages=6207|pmid=25158761|pmc=4145288|bibcode=2014NatSR...
Şenbabaoğlu ''et al'' <ref name="SenbabaogluSREP" /> demonstrated the original delta K metric to decide <math>K</math> in the Monti algorithm performed poorly, and proposed a new superior metric for measuring the stability of consensus matrices using their CDF curves. In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represent those with ambiguous assignments in different clustering runs. The proportion of ambiguous clustering (PAC) score measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u<sub>1</sub>, u<sub>2</sub>) ∈ [0, 1] where u<sub>1</sub> is a value close to 0 and u<sub>2</sub> is a value close to 1 (for instance u<sub>1</sub>=0.1 and u<sub>2</sub>=0.9). A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs. One can therefore infer the optimal number of clusters by the <math>K</math> value having the lowest PAC.<ref name="SenbabaogluSREP" /><ref name="SenbabaogluRXV" />
==Related work==
#'''Fred and Jain''': They proposed to use a single linkage algorithm to combine multiple runs of the ''k''-means algorithm.<ref name="Fred Jain 2005 pp. 835–850">{{cite journal | last1=Fred | first1=Ana L.N. | last2=Jain | first2=Anil K. | title=Combining multiple clusterings using evidence accumulation | journal=IEEE Transactions on Pattern Analysis and Machine Intelligence | publisher=Institute of Electrical and Electronics Engineers (IEEE) | volume=27 | issue=6 | year=2005 | issn=0162-8828 | doi=10.1109/tpami.2005.113 | pages=835–850|pmid= 15943417| s2cid=10316033 |url=http://dataclustering.cse.msu.edu/papers/TPAMI-0239-0504.R1.pdf}}</ref>
▲2. '''Clustering aggregation (Fern and Brodley)''': They applied the clustering aggregation idea to a collection of [[soft clustering]]s they obtained by random projections. They used an agglomerative algorithm and did not penalize for merging dissimilar nodes.{{citation needed|date=July 2020}}
▲4. '''Dana Cristofor and Dan Simovici''': They observed the connection between clustering aggregation and clustering of categorical data. They proposed information theoretic distance measures, and they propose [[genetic algorithm]]s for finding the best aggregation solution.{{citation needed|date=July 2020}}
== Hard ensemble clustering ==
Line 48 ⟶ 45:
=== Efficient consensus functions ===
#'''Cluster-based similarity partitioning algorithm (CSPA)''':In CSPA the similarity between two data-points is defined to be directly proportional to number of constituent clusterings of the ensemble in which they are clustered together. The intuition is that the more similar two data-points are the higher is the chance that constituent clusterings will place them in the same cluster. CSPA is the simplest heuristic, but its computational and storage complexity are both quadratic in ''n''. [http://bioconductor.org/packages/release/bioc/html/SC3.html SC3] is an example of a CSPA type algorithm.<ref>{{Cite journal|last1=Kiselev|first1=Vladimir Yu|last2=Kirschner|first2=Kristina|last3=Schaub|first3=Michael T|last4=Andrews|first4=Tallulah|last5=Yiu|first5=Andrew|last6=Chandra|first6=Tamir|last7=Natarajan|first7=Kedar N|last8=Reik|first8=Wolf|last9=Barahona|first9=Mauricio|last10=Green|first10=Anthony R|last11=Hemberg|first11=Martin|date=May 2017|title=SC3: consensus clustering of single-cell RNA-seq data|journal=Nature Methods|language=en|volume=14|issue=5|pages=483–486|doi=10.1038/nmeth.4236|issn=1548-7091|pmc=5410170|pmid=28346451}}</ref> The following two methods are computationally less expensive:▼
#'''Hyper-graph partitioning algorithm (HGPA)''': The HGPA algorithm takes a very different approach to finding the consensus clustering than the previous method. The cluster ensemble problem is formulated as partitioning the hypergraph by cutting a minimal number of hyperedges. They make use of [http://glaros.dtc.umn.edu/gkhome/metis/hmetis/overview hMETIS] which is a hypergraph partitioning package system.▼
#'''Meta-clustering algorithm (MCLA)''':The meta-cLustering algorithm (MCLA) is based on clustering clusters. First, it tries to solve the cluster correspondence problem and then uses voting to place data-points into the final consensus clusters. The cluster correspondence problem is solved by grouping the clusters identified in the individual clusterings of the ensemble. The clustering is performed using [http://glaros.dtc.umn.edu/gkhome/views/metis METIS] and Spectral clustering.▼
▲In CSPA the similarity between two data-points is defined to be directly proportional to number of constituent clusterings of the ensemble in which they are clustered together. The intuition is that the more similar two data-points are the higher is the chance that constituent clusterings will place them in the same cluster. CSPA is the simplest heuristic, but its computational and storage complexity are both quadratic in ''n''. [http://bioconductor.org/packages/release/bioc/html/SC3.html SC3] is an example of a CSPA type algorithm.<ref>{{Cite journal|last1=Kiselev|first1=Vladimir Yu|last2=Kirschner|first2=Kristina|last3=Schaub|first3=Michael T|last4=Andrews|first4=Tallulah|last5=Yiu|first5=Andrew|last6=Chandra|first6=Tamir|last7=Natarajan|first7=Kedar N|last8=Reik|first8=Wolf|last9=Barahona|first9=Mauricio|last10=Green|first10=Anthony R|last11=Hemberg|first11=Martin|date=May 2017|title=SC3: consensus clustering of single-cell RNA-seq data|journal=Nature Methods|language=en|volume=14|issue=5|pages=483–486|doi=10.1038/nmeth.4236|issn=1548-7091|pmc=5410170|pmid=28346451}}</ref> The following two methods are computationally less expensive:
▲The cluster ensemble problem is formulated as partitioning the hypergraph by cutting a minimal number of hyperedges. They make use of [http://glaros.dtc.umn.edu/gkhome/metis/hmetis/overview hMETIS] which is a hypergraph partitioning package system.
▲First, it tries to solve the cluster correspondence problem and then uses voting to place data-points into the final consensus clusters. The cluster correspondence problem is solved by grouping the clusters identified in the individual clusterings of the ensemble.
== Soft clustering ensembles ==
''Punera'' and ''Ghosh'' extended the idea of hard clustering ensembles to the soft clustering scenario. Each instance in a soft ensemble is represented by a concatenation of ''r'' posterior membership probability distributions obtained from the constituent clustering algorithms. We can define a distance measure between two instances using the [[Kullback–Leibler divergence|Kullback–Leibler (KL) divergence]], which calculates the "distance" between two probability distributions.<ref>Kunal Punera, Joydeep Ghosh. [https://web.archive.org/web/20081201150950/http://www.ideal.ece.utexas.edu/papers/2007/punera07softconsensus.pdf Consensus Based Ensembles of Soft Clusterings]</ref>
#'''{{Proper name|sCSPA}}''': extends CSPA by calculating a similarity matrix. Each object is visualized as a point in dimensional space, with each dimension corresponding to probability of its belonging to a cluster. This technique first transforms the objects into a label-space and then interprets the [[dot product]] between the vectors representing the objects as their similarity.▼
#* Construct Soft Meta-Graph of Clusters▼
#* Group the Clusters into Meta-Clusters▼
#* Collapse Meta-Clusters using Weighting▼
#* Compete for Objects▼
#'''Bayesian consensus clustering (BCC)''': defines a fully [[Bayesian probability|Bayesian]] model for soft consensus clustering in which multiple source clusterings, defined by different input data or different probability models, are assumed to adhere loosely to a consensus clustering.<ref name=LockBCC>{{cite journal|last=Lock|first=E.F.|author2=Dunson, D.B. |title=Bayesian consensus clustering|journal=Bioinformatics|date=2013|doi=10.1093/bioinformatics/btt425|pmid=23990412|pmc=3789539|volume=29|number=20|pages=2610–2616|arxiv=1302.7280|bibcode=
#'''Ensemble Clustering Fuzzification Means (ECF-Means)''': ECF-means is a clustering algorithm, which combines different clustering results in ensemble, achieved by different runs of a chosen algorithm ([[k-means]]), into a single final clustering configuration.<ref name=ZazzECF>{{cite journal|last=Zazzaro|first=Gaetano|author2=Martone, Angelo |title=ECF-means - Ensemble Clustering Fuzzification Means. A novel algorithm for clustering aggregation, fuzzification, and optimization |journal=IMM 2018: The Eighth International Conference on Advances in Information Mining and Management|date=2018}} [https://www.thinkmind.org/articles/immm_2018_2_10_50010.pdf]</ref>
== References ==▼
▲sCSPA extends CSPA by calculating a similarity matrix. Each object is visualized as a point in dimensional space, with each dimension corresponding to probability of its belonging to a cluster. This technique first transforms the objects into a label-space and then interprets the dot product between the vectors representing the objects as their similarity.
▲sMCLA extends MCLA by accepting soft clusterings as input. sMCLA's working can be divided into the following steps:
▲* Construct Soft Meta-Graph of Clusters
▲* Group the Clusters into Meta-Clusters
▲* Collapse Meta-Clusters using Weighting
▲* Compete for Objects
▲HBGF represents the ensemble as a bipartite graph with clusters and instances as nodes, and edges between the instances and the clusters they belong to.<ref>Solving cluster ensemble problems by bipartite graph partitioning, Xiaoli Zhang Fern and [[Carla Brodley]], Proceedings of the twenty-first international conference on Machine learning</ref> This approach can be trivially adapted to consider soft ensembles since the graph partitioning algorithm METIS accepts weights on the edges of the graph to be partitioned. In sHBGF, the graph has ''n'' + ''t'' vertices, where t is the total number of underlying clusters.
▲BCC defines a fully [[Bayesian probability|Bayesian]] model for soft consensus clustering in which multiple source clusterings, defined by different input data or different probability models, are assumed to adhere loosely to a consensus clustering.<ref name=LockBCC>{{cite journal|last=Lock|first=E.F.|author2=Dunson, D.B. |title=Bayesian consensus clustering|journal=Bioinformatics|date=2013|doi=10.1093/bioinformatics/btt425|pmid=23990412|pmc=3789539|volume=29|number=20|pages=2610–2616|arxiv=1302.7280|bibcode=2013arXiv1302.7280L}}</ref> The full posterior for the separate clusterings, and the consensus clustering, are inferred simultaneously via Gibbs sampling.
<references />
▲== References ==
* Aristides Gionis, [[Heikki Mannila]], Panayiotis Tsaparas. [https://web.archive.org/web/20060828084525/http://www.cs.helsinki.fi/u/tsaparas/publications/aggregated-journal.pdf Clustering Aggregation]. 21st International Conference on Data Engineering (ICDE 2005)
* Hongjun Wang, Hanhuai Shan, Arindam Banerjee. [http://www.siam.org/proceedings/datamining/2009/SDM09_022_wangh.pdf Bayesian Cluster Ensembles]{{Dead link|date=November 2019 |bot=InternetArchiveBot |fix-attempted=yes }}, SIAM International Conference on Data Mining, SDM 09
*{{cite conference | last1=Nguyen | first1=Nam | last2=Caruana | first2=Rich | title=Seventh IEEE International Conference on Data Mining (ICDM 2007) | chapter=Consensus Clusterings | publisher=IEEE | year=2007 | pages=607–612 | doi=10.1109/icdm.2007.73 | isbn=978-0-7695-3018-5 |quote=...we address the problem of combining multiple clusterings without access to the underlying features of the data. This process is known in the literature as clustering ensembles, clustering aggregation, or consensus clustering. Consensus clustering yields a stable and robust final clustering that is in agreement with multiple clusterings. We find that an iterative EM-like method is remarkably effective for this problem. We present an iterative algorithm and its variations for finding clustering consensus. An extensive empirical study compares our proposed algorithms with eleven other consensus clustering methods on four data sets using three different clustering performance metrics. The experimental results show that the new ensemble clustering methods produce clusterings that are as good as, and often better than, these other methods.}}
▲* Alexander Topchy, Anil K. Jain, William Punch. [http://dataclustering.cse.msu.edu/papers/TPAMI-ClusteringEnsembles.pdf Clustering Ensembles: Models of Consensus and Weak Partitions]. IEEE International Conference on Data Mining, ICDM 03 & SIAM International Conference on Data Mining, SDM 04
[[Category:Cluster analysis]]
[[Category:NP-complete problems]]
|