A lot of my ideas about Machine Learning come from Quantum Mechanical Perturbation Theory. To provide some context, we need to step back and understand that the familiar techniques of Machine Learning, like Spectral Clustering, are, in fact, nearly identical to Quantum Mechanical Spectroscopy. As usual, this will take several blogs.
Here, I give a brief tutorial on the theory of Spectral Clustering and how it is implemented in open source packaages
At some point I will rewrite some of this and add a review of this recent paper Robust and Scalable Graph-Based Semisupervised Learning
Spectral (or Subspace) Clustering
The goal of spectral clustering is to cluster data that is connected but not lnecessarily compact or clustered within convex boundaries
The basic idea:
- project your data into
- define an Affinity matrix , using a Gaussian Kernel or say just an Adjacency matrix (i.e.
- construct the Graph Laplacian from (i.e. decide on a normalization)
- solve an Eigenvalue problem , such as (or a Generalized Eigenvalue problem )
- select k eigenvectors corresponding to the k lowest (or highest) eigenvalues , to define a k-dimensional subspace
- form clusters in this subspace using, say, k-means
Sounds simple enough. Lets dig into the details a bit to see what works.
Affinities and Similarities
What is an Affinity? It is a metric that determines how close, or Similar, two points our in our space. We have to guess the metric, or we have to learn the metric. For now, we will guess the metric. Notice it is not the standard Euclidean metric. The next simplest thing is, of course, a Gaussian Kernel.
Given 2 data points (projected in ), we define an Affinity that is positive, symmetric, and depends on the Euclidian distance between the data points
We might provide a hard cut off , so that
and in some applications, we pre-multiply with a feature Affinity Kernel (see Shi and Malik’s approach to image segmentation).
( We may also learn the Affinity Matrix from the data. For example, see Learning Spectral Clustering by Bach and Jordan http://www.di.ens.fr/~fbach/nips03_cluster.pdf )
Of course, when the points are close in , and if the points are far apart. Close data points are in the same cluster. Data points in different clusters are far away. But data points in the same cluster may also be far away–even farther away than points in different clusters. Our goal then is to transform the space so that when 2 points are close, they are always in same cluster, and when they are far apart, they are in different clusters.
Generally we use the Gaussian Kernel directly, or we form the Graph Laplacian . Usually the Graph Laplacian is described as a discrete approximation to the Laplacian from Physics, and or the Laplace-Beltrami operator.
Here, I want to point out the obvious relation between the Gaussian Kernel and a simple Adjacency Matrix. Taking the lowest order Taylor expansion for the Gaussian Kernel, we get
The unnormalized Graph Laplacian is defined as the difference of 2 matrices
where is the diagonal Degree matrix, defined below, assigned to the graph vertices, and is a matrix of positive Weights assigned to the graph edges. Let us evaluate the expectation value of with vector (Proposition 1, von Luxburg):
When using an Adjacency Matrix, the weights are all 1:
So, apart from the diagonal and some normalization, we see that the Graph Laplacian for an Adjacency Matrix s a low order approximation to the Gaussian Kernel, which is a good approximation when are in the same cluster.
In the real world, one usually applies a Gaussian or Heat kernel in order to account for irregularities in the data, as in the SciKit learn package.
There are lots of Laplacians in the literature. Normalized, Generalized, Relaxed, etc.
They all share the Degree matrix . The Degree matrix is a diagonal matrix that measures the degree at each node
being an old physical chemist, I think of this as a local (nearest neighbor) mean-field average of the Affinity; is a normalization factor for so that the cluster Affinities are balanced across different clusters. This is easier to see in a Graph Cut formulation (later)
We can now define the Laplacian and outline some variations
The main differences are how one normalizes the data and sets the diagonal of
- Simple Laplacian
- Normalized Laplacian
- Generalized Laplacian
- Relaxed Laplacian
- Ng, Jordan, & Weiss Laplacian , and where
- and we note the related, smoothed Kernel for Kmeans Kernel Clustering
What are the main differences? Left multiplying by a diagonal matrix is akin to scaling the rows, but right multiplying by a diagonal matrix is akin to scaling the columns. The generalized Laplacian results in a right-stochastic Markov matrix ; the normalized Laplacian does not.
I might have missed some variations since clearly there is a lot of leeway and creativity in defining a Laplacian for clustering. In a future post I will explain how these Graph Laplacians are related to the classical Laplacian from physics. The Von Luxburg review attempts to clean this up, and I may re-write this section based on his review.
The Cluster Eigenspace Problem
If good clusters can be identified, then the Laplacian is approximately block-diagonal, with each block defining a cluster. So, if we have 3 major clusters , we would expect
Where corresponds to subblock for cluster , etc. These blocks let us identify clusters with non-convex boundaries, as shown below
We also expect that the 3 lowest eigenvalues & eigenvectors of each correspond to a different cluster. This occurs when each eigenvector corresponds to the lowest eigenvector of some subblock of . That is, if
are the lowest eigenvalue, eigenvector pairs in the full space, and
is the lowest eigenvalue , eigenvector pair for block C1,
then is a good approximation to one of the lowest 3 . Likewise for subblocks C2 and C3.
More importantly, this also restricts the eigenvalue spectrum of , so that the set lowest 3 full space eigenvalues consists of the lowest subblock eigenvalues
Also, to identify k clusters, the eigenvalue spectrum of must have a gap, as shown below (with 4, not 3 eigenvalues, sorry)
Frequently this gap is hard to find, and choosing the optimal k is called “rounding”
I am being a bit sloppy to get the general ideas across. Technically, running Kmeans in the subspace is not exactly the same as identifying each eigenvector with a specific cluster. Indeed, one might imagine using a slightly larger subspace than necessary, and only extracting the k clusters desired. The subtly here is getting choosing the right Affinity ( matrix cutoff R and all), the right size of the subspace, and the right normalization (both of the Laplacian and the eigenvectors themselves, before or after diagonalization) You should refer to the original papers, the reviews, and whatever open source code you are using for very specific details.
Recently (well, kinda, 2008), these ideas have been formalized in L. ROSASCO, M. BELKIN, E. DE VITOA NOTE ON PERTURBATION RESULTS FOR LEARNING, EMPIRICAL OPERATORS
using some perturbation theory and the Resolvent Operators introduced earlier…which gives you an idea of what I like to read just for fun.
[ In the language of Quantum Mechanical Perturbation Theory, we would say that the eigenvectors of the subblocks form a Model Space that approximates the true low lying (or high lying) Eigenspace of . This will come up later ]
Constraints on Real World Data
Below we depict the block structure of the Affinity matrix for good and poor cases of the Similarity metric
Notice that the good Similarity / Affinity matrix is itself block diagonal when 2 clear clusters can be identified
Open Source Implementations
The Python toolkit Scikit Learn has an implementation of spectral clustering.
Rather than review this, I just want to comment on the 2 examples because neither actually demonstrate where the method is most useful.
The first example is simply to identify 4 overlapping circular clusters. Here, even simple Kmeans would probably be fine because the clusters are compact. They argue the point is to separate the clusters from the background, but, again, this is a simple problem.
Here, the method does not do much because simple Spectral Clustering can not ‘learn’ from other examples. Here really one wants a supervised or semi-supervised method such as SemiSupervised Learning using Laplacian / Manifold Regularization or even Deep Learning that can identify faces based on a large training sample of many images of many different faces.
The power of Spectral Clustering is to identify non-compact clusters in a single data set (see images above)
The constraint on the eigenvalue spectrum also suggests, at least to this blogger, Spectral Clustering will only work on fairly uniform datasets–that is, data sets with N uniformly sized clusters. (This is also briefly mentioned by Von Luxburg , where he suggests always using the Generalized Laplacian.) Otherwise, there is no guarantee that the full space and subblock eigenspectrums will line up so nicely.
Some time later, I will take a look at some real world data, look at Andrew Ng’s old argument about matrix perturbation theory, and then explain how to apply matrix perturbation theory to (hopefully) deal the cases of a Poor similarity measure (right side above) .
But first, we take a look at some more modern work by Shi, Belkin, and Yu, called Data Spectroscopy. We will see that Spectral Clustering with an RBF Kernel is, in essence, just like Quantum Mechanical Spectroscopy where the data points are Quantum Mechanical Harmonic Oscillators!
Then we will look at generalizations of Spectral Clustering such as using the Graph Laplacian as a Regularizer for SemiSupervised Manifold Learning.
Von Luxburg, A Tutorial on Spectral Clustering, http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/attachments/Luxburg07_tutorial_4488%5B0%5D.pdf
Filippone, Camastra, Masulli, & Rovetta ,A survey of kernel and spectral methods for clustering, http://lagis-vi.univ-lille1.fr/~lm/classpec/publi_classif/A_survey_of_kernel_and_spectral_methods_for_clustering_PR_2008.pdf
Smola and Kondor, Kernels and Regularization on Graphs http://enpub.fulton.asu.edu/cseml/07spring/kenel_graphs.pdf
and one of the more “popular” papers
Ng, Jordan, and Weiss, On Spectral Clustering: Analysis and an Algorithm, http://ai.stanford.edu/~ang/papers/nips01-spectral.pdf