Spectral clustering: Difference between revisions

Content deleted Content added
Software: started a new section on the costs
Costs: extended to cover the case of dense adjacency matrices
Line 59:
 
==Costs==
Denoting the number of the data points ny <math>n</math>, it is important to estimate the memory footprint and compute time, or number of arithmetic operations (AO) performed, as a function of <math>n</math>. No matter the algorithm of the spectral clustering, the two main costly items are the construction of the graph Laplacian and determining its <math>k</math> eigenvectors for the spectral embedding. The last step — determining the labels from the <math>n</math>-by-<math>k</math> matrix of eigenvectors — is typically the least expensive requiring only <math>kn</math> AO and creating just a <math>n</math>-by-<math>1</math> vector of the labels in memory.
 
The need to construct the the graph Laplacian is common for all distance- or correlation-based clustering methods. Computing the eigenvectors is specific to spectral clustering only.
 
===Constructing graph Laplacian===
The graph Laplacian can be and commonly is constructed from the adjacency matrix. The construction can be performed matrix-free, i.e., without explicitly forming the matrix of the graph Laplacian and no AO. It can also be performed in-place of the adjacency matrix without increasing the memory footprint. Either way, the costs of constructing the graph Laplacian is essentially determined by the costs of constructing the <math>n</math>-by-<math>n</math> graph adjacency matrix.
 
Naive constructions of the graph adjacency matrix make it dense, thus requiring <math>n^2</math> memory and <math>n^2</math> AO to determine each of the <math>n^2</math> entries of the matrix.
 
===Computing eigenvectors===
The cost of computing the <math>n</math>-by-<math>k</math> (with <math>k<<n</math>) matrix of selected eigenvectors of the graph Laplacian is normally proportional to the cost of multiplication of the graph Laplacian matrix by a vector, which varies greatly whether the graph Laplacian matrix is dense or sparse. For the dense case the cost thus is <math>O(n^2)</math>. The very commonly cited in the literature cost <math>O(n^3)</math> comes from choosing <math>k=n</math> and is clearly misleading, since, e.g., in a hierarchical spectral clustering <math>k=1</math> as determined by the [[Fiedler vector]].
 
==Software==