Metric Learning: Some Quantum Statistical Mechanics

I wrote this for my buddy Sebass; just a quick review of quantum stat mech:

In our last post, we presented a formalism for music recommendations called Logistic Markov Embedding (LME).  This  technique uses some mathematics that arises in other modern machine learning methods, such as Maximum Entropy  (MaxEnt) methods, and the Restricted Boltzmann Machines (RMBs) in Deep Learning.

In this post, we are going to take a step back and  understand where the idea for a metric embedding comes from and the mathematical formalism employed by Joachims. The Metric Embedding in MLE is a kind of Energy model  — an idea from Quantum Statistical Mechanics.So we will review the pedagogic formalism of  Statistical  Mechanics for Thermodynamics.

We will follow the classic text by T. Hill, and one of the first books I ever read in college.  Also, we will not look at more abstract version of Quantum Statistical Mechanics because we don’t need to deal with the phase space formulation until we look at time-dependent processes like Causality (but eventually we will get there).  For a deeper view of this subject, we defer to the classic text by Jaynes, Probability Theory: The Logic of Science.

Probabilities and the Partition Functions

Frequently we see expressions for probabilities, as in the LME approach,  written as:

$Pr(p)=\underset{i}\Pi\dfrac{\exp(-\Delta(p^{i},p^{i-1})^{2})}{Z(p^{i-1})}$

where $Z(p^{i})$ is partition function, given by

$Z(p^{i})=\sum_{j=1}^{|S|}\exp(-\Delta(p^{i},s_{j})^{2})$

Naively, we can just think that we are defining the probability in an abstract way, and that $Z(p^{i})$ is just the normalization.  But there’s more to it.  In fact, it is deeply connected  the Boltzman / Gibbs Distribution

So why in the world do we do this?

Convex and non-Convex Optimization

In simpler machine learning algos like SVMs and Logistic Regression, we can readily solve the optimization problems  using Stochastic Gradient Descent (SGD)– because these are convex optimization problems.

Below we see two optimization surfaces: our goal is to find the global minimum.   The top problem is easy to solve, and just about any decent descent method will do (unless we need sparse solutions, which makes this a little trickier).  The bottom problem, however, is filled with hills and bumps, and even looking at it, we are not sure where the bottom is.

So how do we find it?  Typical solvers for non-convex problems include Genetic Algorithms, Convex-Concave/DC programming,  BackProp (for ANNs), and Simulated Annealing.

In Simulated Annealing, we run the solver until it gets stuck.  We then  heat the system, making it jump over local humps. We then cool it down–we run the solver again until it stops.  We repeat this over and over until we find the global minimum .  A common application is in the simulation of liquids and crystals. To do this, we need to define a temperature.

In traditional probability theory, we require the individual probabilities sum to 1:

$\sum P_{1}=1$

Of course this is always true, but we don’t have to enforce this all the time; we can relax this constraint and simply enforce it at the end of the problem.  This is exactly what we do in Simulate Annealing. Instead of creating probability models, we create Energy Models.  The energies are essentially the log of unnormalized probabilities

$E_{j}\sim\log(P_j)$

Which is what we see in LME.  We only require  the total energy $E_t$ to be constant–not 1.  This means when we compute $P_j$

$P_{j}=\dfrac{\exp(-\beta E_j)}{Z}$

we need the normalization factor $Z$

$Z=\sum \exp(-\beta E_i)$

Modeling with Energies

Lets say we have a bunch of objects that we would like to distribute (or cluster) into  boxes, sets, or sequences in some optimal way.  We might be placing balls into boxes, or we could be assigning documents into clusters, or we are finding personalized playlists.  We want to find the optimal way of doing this through some constrained optimization problem.   Imagine we have a container partitioned into small, disjoint boxes or sets $j$.  We have

1.  the total number of objects $latex \mathcal{N}$ is fixed. For us, the total number songs.
2.  the total ‘Volume’  V is fixed.  This will come up later when we create the metric embedding

We wanted to find the optimal probability of a song to appear in a playlist.  Here, more generally, we seek the probability $P_{j}$, of putting an object in a box $j$, subject to our prior knowledge of the problem.  In other approaches, we might define a  conditional probability for each box.  Here, we say each box has an energy $E_j$ . We require total Energy $E_{t}$ of the system is always fixed. We can place  $n_{j}$ objects into each box $j$. We only require

$\sum_{j}n_{j}=\mathcal{N}$

Each combination, or set of numbers,  ${n_1,n_2\cdots}$ constitutes a distribution $\Omega_{t}(n)$.  There are many ways of doing this;  total number of combinations is given by the multinomial coefficients

$\Omega_{t}(n)=\dfrac{\left(n_{1}+n_{2}\cdots\right)!}{n_{1}!n_{2}!\cdots}$

$\Omega_{t}(n)=\dfrac{\mathcal{N}!}{\prod_{j}n_{j}!}$

Each combination has a specific configuration of energies $E_{j}$.  For example, if we place 4 objects A, B, C, and D into boxes 2, 3, 2, and 1, respectively the total energy of this configuration (or micro-state) is given by

$E_{1}+2E_{2}+E_{3}=E_{t}$

Of course, this configuration is just like a playlist, except we have not accounted for the ordering explicitly here.  We can consider other configurations as well

but they all have the same total energy $E_t$.  This leads to the constraints

$\sum_{j}n_{j}E_{j}=E_{t}$

The probabilities of being in each box $P_j$ is easy computed for small $latex \mathcal{N}$

$P_{j}=\dfrac{\bar{n_{j}}}{\mathcal{N}!}$

where $\bar{n_{j}}$ is the average number of objects in each box.   And we can compute the average energy $\bar{E}$

$\bar{E}=\sum_{j}P_{j}E_{j}$

We want to find the most likely distribution $\Omega(n*)$ of objects, subject to the constraints on the distribution. But as soon as N gets large, we can not just evaluate these sums because they are too large.  But we note that the most likely distribution $\Omega(n*)$ is very highly peaked around   $n*$

as is the most likely average Energy

Which means that we can approximate the most likely distribution by trying to find the minimum value of $\Omega(n*)$.  Since this distribution is so peaked, we can approximate this very well by minimizing the log of the distribution

$\min\log(\Omega(n*))$ subject to our constraints

$\sum_{j}n_{j}=\mathcal{N}$ and $\sum_{j}n_{j}E_{j}=E_{t}$.

We can solve this optimization using the method of Lagrange multiplies, as we do in many convex optimization problems in ML.  This yields

$\dfrac{\partial}{\partial n_{j}}\left[\ln\Omega_{t}(n)-\alpha\sum_{i}n_{i}-\beta\sum_{i}n_{i}E_{i}\right]=0$

where $\alpha,\beta$ are the Lagrange multiplies (and which usually appear as adjustable parameters in other ML problems). We want to solve this problem in the limit $\lim\mathcal{N}\rightarrow\infty$.  We take advantage of

Stirling’s Approximation

$\ln\mathcal{N}!\cong\mathcal{N}\ln\mathcal{N}-\mathcal{N}$

to write

$\ln\Omega_{t}(n)=\left(\sum_{i}n_{i}\right)\left(\sum_{i}n_{i}\ln n_{i}\right)-\left(\sum_{i}n_{i}\ln n_{i}\right)$

Now we can evaluate the derivatives explicitly

$\ln\left(\sum_{i}n_{i}\right)-\ln\left(n_{i}^{*}\right)-\alpha-\beta E_{j}=0$

With this simple expression, we can find the optimized, average number of objects in each box

$n_{j}^{*}=\mathcal{N}e^{-\alpha}e^{-\beta E_{j}}$

where

$e^{\alpha}=\sum_{j}e^{-\beta E_{j}}$

and the optimal probability of being in each box

$P_{j}=\dfrac{n_{j}^{*}}{\mathcal{N}!}=\dfrac{e^{-\beta E_{j}}}{\sum_{j}e^{-\beta E_{j}}}$

which is known as as the Boltzman Distribution, or , more generally, the Gibbs Measure.

The average energy of the system, which is a kind of optimal normalization for the problem, is given by

$\bar{E}=\dfrac{\sum_{j}E_{j}e^{-\beta E_{j}}}{\sum_{j}e^{-\beta E_{j}}}$

We now recognize that the energies are log probabilities

$E_{j}\sim\log(P_j)$

and we call the normalization factor the Partition Function $Z$

$Z=\sum_{j}e^{-\beta E_{j}}$

We call $\beta$ the inverse temperature, because   $\beta\sim\dfrac{1}{T}$ in traditional thermodynamics.

We have completed out mission; we have introduced the notion of a Temperature into our probabilistic models.

Using the Temperature:  Simulated Annealing

We should not use a formalism unless we intent to take advantage of it.  Joachim’s uses the LME formalism to create a complicated model, but he only uses SGD to solve it.  We can do better than this.  Indeed, for Deep Learning, we will have to.

Of course, it is a bit more technical than this.  And this is not the only way to solve the problem. Plus  we still need Regularization to avoid overfitting, and obtain sparse solutions.   We will live this to another post.

Still, we see that the language of Statistical Physics offers many advantages over traditional probabilistic models–and this why it is so popular in machine learning.  In our next post we show how to combine the Entropy with an SVM to provide a very powerful, hybrid, semi-supervised learning method called a Transductive SVM.

11 thoughts on “Metric Learning: Some Quantum Statistical Mechanics”

1. It’s a very interesting post, it is always nice to see the fundamental principles of the methods we often use in ML (and sometimes forget what stands behind them). However, I find a few points difficult to digest. Could you shed some light on them?

1. I see you are looking for the most likely arrangement of the ‘balls’ in the boxes. But what allows you to switch from searching for the maximum likelihood to the minimum one? I refer to the $\min \log \Omega(n)$.

2. I don’t see the argument connecting logarithmic function with the ‘peakiness’ of the distribution. I have always been thinking that the the reason of the log-mapping stem from the monotonicity preservation of log for certain family of the logarithmic basis.

3. I would be interested in understanding why by increasing T (or decreasing the Lagrange multiplier) we enable the optimizer to ‘jump’ over the local minima. What is the algebraic argument? Is it somehow connected with the Stochastic Gradient Methods?

Thanks!

Like

• charlesmartin14 says:

1. by the most likely arrangement i mean the one that is most probable. what is ‘the minimum one’?

2. because the distribution is highly peaked, when we take the log, the distribution does not really change that much. this was the classical argument; these days people kinda ignore this point and just assume that the log of the distribution is representative of the distribution, and that’s not the case in the tails of the distribution for small data sets

3. Yeah we don’t really adjust T in the SGD itself. What we do is take a random jump, and then accept or reject the jump based on temperature threshold. I need to explain in code so i will try to write some up.

what language is best? ruby? python? java? lua?

Like

• charlesmartin14 says:

on 1. I might have switched a sign let me check and get back to you
on 2. yes this is subtle but it is the intent. this is sometimes missed in ML
on 3. SGD only works for convex methods…

Like

2. Hi, I was thinking of aggregating your posts into a toolkit (python perhaps?), do you think it would be useful?

Like

• charlesmartin14 says:

we could certainly do something like this with some of the newer posts
I have a lot of ideas and not enough time to write up everything
usually I make a github account and post code there, as i did with the ruby work

I had some issues displaying the ipython notbook on wordpress and it is was unclear to me if i need to upgrade to a paid account to do this
do you know?

Like

3. […] discussed basic Statistical Mechanics and Deterministic Annealing in our previous […]

Like

4. […] Safe Semi-Supervised SVM methods.  The S4VM tries to improve upon the SvmLin DA-TSM algo using Simulated Annealing, and then selects the best solutions by evaluating the final cluster […]

Like

5. […] in this related post, I explain the partition function and where it comes from. […]

Like

6. […] is the question posed by a recent article.  Deep Learning seems to require knowing the Partition Function–at least in old fashioned Restricted Boltzmann Machines […]

Like