Spectral clustering: Difference between revisions

Content deleted Content added
m link [mM]emory footprint
Added wikilink
Tags: Visual edit Mobile edit Mobile web edit
 
(10 intermediate revisions by 8 users not shown)
Line 2:
[[Image:6n-graf.svg|thumb|150px|An example connected graph, with 6 vertices.]]
[[File:6n-graf2.svg|thumb|150px|Partitioning into two connected graphs]]
In [[multivariate statistics]], '''spectral clustering''' techniques make use of the [[Spectrum of a matrix|spectrum]] ([[eigenvalues]]) of the [[similarity matrix]] of the data to perform [[dimensionality reduction]] before [[Cluster analysis|clustering]] in fewer dimensions. The similarity [[Matrix (mathematics)|matrix]] is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.
 
In application to image segmentation, spectral clustering is known as [[segmentation-based object categorization]].
Line 8:
== Definitions ==
 
Given an enumerated set of data points, the [[similarity matrix]] may be defined as a symmetric matrix <math>A</math>, where <math>A_{ij}\geq 0</math> represents a measure of the similarity between data points with indices <math>i</math> and <math>j</math>. The general approach to spectral clustering is to use a standard [[Cluster analysis|clustering]] method (there are many such methods, ''k''-means is discussed [[#Relationship with k-means|below]]) on relevant [[eigenvector]]s of a [[Laplacian matrix]] of <math>A</math>. There are many different ways to define a Laplacian which have different mathematical interpretations, and so the clustering will also have different interpretations. The eigenvectors that are relevant are the ones that correspond to smallest several smallest eigenvalues of the Laplacian except for the smallest eigenvalue which will have a value of 0. For computational efficiency, these eigenvectors are often computed as the eigenvectors corresponding to the largest several eigenvalues of a function of the Laplacian.
 
===[[Laplacian matrix]]===
Line 54:
If the similarity matrix <math>A</math> has not already been explicitly constructed, the efficiency of spectral clustering may be improved if the solution to the corresponding eigenvalue problem is performed in a [[Matrix-free methods|matrix-free fashion]] (without explicitly manipulating or even computing the similarity matrix), as in the [[Lanczos algorithm]].
 
For large-sized graphs, the second eigenvalue of the (normalized) graph [[Laplacian matrix]] is often [[ill-conditioned]], leading to slow convergence of iterative eigenvalue solvers. [[Preconditioner#Preconditioning for eigenvalue problems|Preconditioning]] is a key technology accelerating the convergence, e.g., in the matrix-free [[LOBPCG]] method. Spectral clustering has been successfully applied on large graphs by first identifying their [[community structure]], and then clustering communities.<ref>{{cite journal|last1=Zare|first1=Habil |first2=P. |last2=Shooshtari |first3=A. |last3=Gupta |first4=R. |last4=Brinkman|title=Data reduction for spectral clustering to analyze high throughput flow cytometry data|journal=BMC Bioinformatics|date=2010|doi=10.1186/1471-2105-11-403|volume=11|pagesarticle-number=403 |pmid=20667133 |pmc=2923634 |doi-access=free }}</ref>
 
Spectral clustering is closely related to [[nonlinear dimensionality reduction]], and dimension reduction techniques such as locally-linear embedding can be used to reduce errors from noise or outliers.<ref>{{Citation
Line 75:
Moreover, a normalized Laplacian has exactly the same eigenvectors as the normalized adjacency matrix, but with the order of the eigenvalues reversed. Thus, instead of computing the eigenvectors corresponding to the smallest eigenvalues of the normalized Laplacian, one can equivalently compute the eigenvectors corresponding to the largest eigenvalues of the normalized adjacency matrix, without even talking about the Laplacian matrix.
 
Naive constructions of the graph [[adjacency matrix]], e.g., using the RBF kernel, make it dense, thus requiring <math>n^2</math> memory and <math>n^2</math> AO to determine each of the <math>n^2</math> entries of the matrix. Nystrom method<ref>{{Cite journal|last=Fowlkes|first=C|date=2004|title=Spectral grouping using the Nystrom method.|url=https://escholarship.org/uc/item/29z29233|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|volume=26|issue=2|pages=214–25|doi=10.1109/TPAMI.2004.1262185|pmid=15376896|bibcode=2004ITPAM..26..214F|s2cid=2384316}}</ref> can be used to approximate the similarity matrix, but the approximate matrix is not elementwise positive,<ref>{{Cite journal|firstfirst1=S. |last1=Wang |first2=A. |last2=Gittens |first3=M.W. |last3=Mahoney|year=2019|title=Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds|journal=Journal of Machine Learning Research|volume=20|pages=1–49|arxiv=1706.02803}}</ref> i.e. cannot be interpreted as a distance-based similarity.
 
Algorithms to construct the graph adjacency matrix as a [[sparse matrix]] are typically based on a [[nearest neighbor search]], which estimate or sample a neighborhood of a given data point for nearest neighbors, and compute non-zero entries of the adjacency matrix by comparing only pairs of the neighbors. The number of the selected nearest neighbors thus determines the number of non-zero entries, and is often fixed so that the memory footprint of the <math>n</math>-by-<math>n</math> graph adjacency matrix is only <math>O(n)</math>, only <math>O(n)</math> sequential arithmetic operations are needed to compute the <math>O(n)</math> non-zero entries, and the calculations can be trivially run in parallel.
Line 82:
The cost of computing the <math>n</math>-by-<math>k</math> (with <math>k\ll n</math>) matrix of selected eigenvectors of the graph Laplacian is normally proportional to the cost of multiplication of the <math>n</math>-by-<math>n</math> graph Laplacian matrix by a vector, which varies greatly whether the graph Laplacian matrix is dense or sparse. For the dense case the cost thus is <math>O(n^2)</math>. The very commonly cited in the literature cost <math>O(n^3)</math> comes from choosing <math>k=n</math> and is clearly misleading, since, e.g., in a hierarchical spectral clustering <math>k=1</math> as determined by the [[Fiedler vector]].
 
In the sparse case of the <math>n</math>-by-<math>n</math> graph Laplacian matrix with <math>O(n)</math> non-zero entries, the cost of the matrix-vector product and thus of computing the <math>n</math>-by-<math>k</math> with <math>k\ll n</math> matrix of selected eigenvectors is <math>O(n)</math>, with the memory footprint also only <math>O(n)</math> — both are the optimal low bounds of complexity of clustering <math>n</math> data points. Moreover, matrix-free eigenvalue solvers such as [[LOBPCG]] can efficiently run in parallel, e.g., on multiple [[GPUs]] with distributed memory, resulting not only in high quality clusters, which spectral clustering is famous for, but also top performance. <ref name="msw2014">{{Cite journal|last1=Acer|first1=Seher|last2=Boman|first2=Erik G.|last3=Glusa|first3=Christian A.|last4=Rajamanickam|first4=Sivasankaran|year=2021|title=Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems|journal=Parallel Computing|volume=106|pagearticle-number=102769 |doi=10.1016/j.parco.2021.102769|s2cid=233481603 |arxiv=2105.00578}}</ref>
 
==Software==
Free software implementing spectral clustering is available in large open source projects like [[scikit-learn]]<ref>{{Cite web|url=http://scikit-learn.org/stable/modules/clustering.html#spectral-clustering|title = 2.3. Clustering}}</ref> using [[LOBPCG]]<ref>{{Cite conference | url = https://www.researchgate.net/publication/343531874 | title = Modern preconditioned eigensolvers for spectral image segmentation and graph bisection | conference = Clustering Large Data Sets; Third IEEE International Conference on Data Mining (ICDM 2003) Melbourne, Florida: IEEE Computer Society| editor = Boley| editor2 = Dhillon| editor3 = Ghosh| editor4 = Kogan | pages = 59–62| year = 2003| last1 = Knyazev| first1 = Andrew V.}}</ref> with [[multigrid]] [[preconditioning]]<ref name="spectralmultigrid2006">{{Cite conference | url = https://www.researchgate.net/publication/354448354 | title = Multiscale Spectral Image Segmentation Multiscale preconditioning for computing eigenvalues of graph Laplacians in image segmentation | conference = Fast Manifold Learning Workshop, WM Williamburg, VA| year = 2006| last1 = Knyazev| first1 = Andrew V. | doi=10.13140/RG.2.2.35280.02565}}</ref> <ref>{{Cite conference | url = https://www.researchgate.net/publication/343531874 | title = Multiscale Spectral Graph Partitioning and Image Segmentation | conference = Workshop on Algorithms for Modern Massive Datasets Stanford University and Yahoo! Research| year = 2006| last1 = Knyazev| first1 = Andrew V.}}</ref> or [[ARPACK]], [[Apache Spark#MLlib Machine Learning Library|MLlib]] for pseudo-eigenvector clustering using the [[power iteration]] method,<ref>{{Cite web|url=http://spark.apache.org/docs/latest/mllib-clustering.html#power-iteration-clustering-pic|title = Clustering - RDD-based API - Spark 3.2.0 Documentation}}</ref> and [[R (programming language)|R]].<ref>{{Cite web|url=https://cran.r-project.org/web/packages/kernlab|title = Kernlab: Kernel-Based Machine Learning Lab|date = 12 November 2019}}</ref>
 
== Relationship with other clustering methods ==
Line 103:
 
=== Relationship with ''k''-means ===
Spectral clustering is closely related to the '''k-means''' algorithm, especially in how cluster assignments are ultimately made. Although the two methods differ fundamentally in their initial formulations—spectral clustering being graph-based and k-means being centroid-based—the connection becomes clear when spectral clustering is viewed through the lens of '''kernel methods'''.
The weighted kernel ''k''-means problem<ref name="dhillon2004kernel">{{cite conference
 
| last1 = Dhillon |first1=I.S. |last2=Guan |first2=Y. |last3=Kulis |first3=B.
In particular, '''weighted kernel k-means''' provides a key theoretical bridge between the two. Kernel k-means is a generalization of the standard k-means algorithm, where data is implicitly mapped into a high-dimensional feature space through a kernel function, and clustering is performed in that space. Spectral clustering, especially the normalized versions, performs a similar operation by mapping the input data (or graph nodes) to a lower-dimensional space defined by the '''eigenvectors of the graph Laplacian'''. These eigenvectors correspond to the solution of a '''relaxation''' of the '''normalized cut''' or other graph partitioning objectives.
| year = 2004
 
| title = Kernel ''k''-means: spectral clustering and normalized cuts
Mathematically, the objective function minimized by spectral clustering can be shown to be equivalent to the objective function of weighted kernel k-means in this transformed space. This was formally established in works such as <ref name="dhillon2004kernel">{{cite conference |last1=Dhillon |first1=I.S. |last2=Guan |first2=Y. |last3=Kulis |first3=B. |year=2004 |title=Kernel ''k''-means: spectral clustering and normalized cuts |url=https://www.cs.utexas.edu/users/inderjit/public_papers/kdd_spectral_kernelkmeans.pdf |pages=551–6 |book-title=Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining}}</ref> where they demonstrated that normalized cuts are equivalent to a weighted version of kernel k-means applied to the rows of the normalized Laplacian’s eigenvector matrix.
| book-title = Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
 
| pages = 551–6
Because of this equivalence, '''spectral clustering can be viewed as performing kernel k-means in the eigenspace defined by the graph Laplacian'''. This theoretical insight has practical implications: the final clustering step in spectral clustering typically involves running the '''standard k-means algorithm''' on the rows of the matrix formed by the first k eigenvectors of the Laplacian. These rows can be thought of as embedding each data point or node in a low-dimensional space where the clusters are more well-separated and hence, easier for k-means to detect.
| url = https://www.cs.utexas.edu/users/inderjit/public_papers/kdd_spectral_kernelkmeans.pdf
 
}}</ref>
sharesAdditionally, the'''multi-level methods''' have been developed to directly optimize this shared objective function. withThese methods work by iteratively '''coarsening''' the spectralgraph clusteringto reduce problem size, whichsolving canthe beproblem optimizedon directlya bycoarse multigraph, and then '''refining''' the solution on successively finer graphs. This leads to more efficient optimization for large-levelscale methodsproblems, while still capturing the global structure preserved by the spectral embedding.<ref>{{cite journal |last1=Dhillon |first1=Inderjit |last2=Guan |first2=Yuqiang |last2last3=GuanKulis |first3=Brian |last3date=KulisNovember 2007 |title=Weighted Graph Cuts without Eigenvectors: A Multilevel Approach |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|date=November 2007|volume=29 |issue=11 |pages=1944–1957 |citeseerx=10.1.1.131.2635 |doi=10.1109/tpami.2007.1115 |pmid=17848776 |citeseerxbibcode=102007ITPAM.1.129.131.26351944D |s2cid=9402790}}</ref>
 
=== Relationship to DBSCAN ===
InSpectral clustering is also conceptually related to '''DBSCAN''' (Density-Based Spatial Clustering of Applications with Noise), particularly in the trivialspecial case ofwhere the spectral method is used to determiningidentify [[Connected component (graph theory)|'''connected graph components''']] of a graph. In this trivial case—where the optimalgoal is to identify subsets of clustersnodes with '''no interconnecting edges''' cutbetween them—the spectral clusteringmethod iseffectively also relatedreduces to a spectral version of [[DBSCAN]]connectivity-based clustering thatapproach, findsmuch density-connectedlike componentsDBSCAN.<ref>{{Cite conference |last1=Schubert |first1=Erich |last2=Hess |first2=Sibylle |last3=Morik |first3=Katharina |date=2018 |title=The Relationship of DBSCAN to Matrix Factorization and Spectral Clustering |url=http://ceur-ws.org/Vol-2191/paper38.pdf |conference=LWDA |pages=330–334}}</ref>
 
DBSCAN operates by identifying '''density-connected regions''' in the input space: points that are reachable from one another via a sequence of neighboring points within a specified radius (ε), and containing a minimum number of points (minPts). The algorithm excels at discovering clusters of arbitrary shape and separating out noise without needing to specify the number of clusters in advance.
 
In spectral clustering, when the similarity graph is constructed using a '''hard connectivity criterion''' (i.e., binary adjacency based on whether two nodes are within a threshold distance), and no normalization is applied to the Laplacian, the resulting eigenstructure of the graph Laplacian directly reveals '''disconnected components''' of the graph. This mirrors DBSCAN's ability to isolate '''density-connected components'''. The zeroth eigenvectors of the unnormalized Laplacian correspond to these components, with one eigenvector per connected region.
 
This connection is most apparent when spectral clustering is used not to optimize a soft partition (like minimizing the normalized cut), but to '''identify exact connected components'''—which corresponds to the most extreme form of “density-based” clustering, where only directly or transitively connected nodes are grouped together. Therefore, spectral clustering in this regime behaves like a '''spectral version of DBSCAN''', especially in sparse graphs or when constructing ε-neighborhood graphs.
 
While DBSCAN operates directly in the data space using density estimates, spectral clustering transforms the data into an eigenspace where '''global structure and connectivity''' are emphasized. Both methods are non-parametric in spirit, and neither assumes convex cluster shapes, which further supports their conceptual alignment.
 
== Measures to compare clusterings ==