I wrote this for my buddy Sebass; just a quick review of quantum stat mech:

In our last post, we presented a formalism for music recommendations called Logistic Markov Embedding (LME). This technique uses some mathematics that arises in other modern machine learning methods, such as Maximum Entropy (MaxEnt) methods, and the Restricted Boltzmann Machines (RMBs) in Deep Learning.

In this post, we are going to take a step back and understand where the idea for a metric embedding comes from and the mathematical formalism employed by Joachims. The Metric Embedding in MLE is a kind of Energy model — an idea from Quantum Statistical Mechanics.So we will review the pedagogic formalism of Statistical Mechanics for Thermodynamics.

We will follow the classic text by T. Hill, and one of the first books I ever read in college. Also, we will not look at more abstract version of Quantum Statistical Mechanics because we don’t need to deal with the phase space formulation until we look at time-dependent processes like Causality (but eventually we will get there). For a deeper view of this subject, we defer to the classic text by Jaynes, Probability Theory: The Logic of Science.

# Probabilities and the Partition Functions

Frequently we see expressions for probabilities, as in the LME approach, written as:

where is partition function, given by

Naively, we can just think that we are defining the probability in an abstract way, and that is just the normalization. But there’s more to it. In fact, it is deeply connected the Boltzman / Gibbs Distribution

*So why in the world do we do this?*

#### Convex and non-Convex Optimization

In simpler machine learning algos like SVMs and Logistic Regression, we can readily solve the optimization problems using Stochastic Gradient Descent (SGD)– because these are convex optimization problems.

Below we see two optimization surfaces: our goal is to find the global minimum. The top problem is easy to solve, and just about any decent descent method will do (unless we need sparse solutions, which makes this a little trickier). The bottom problem, however, is filled with hills and bumps, and even looking at it, we are not sure where the bottom is.

So how do we find it? Typical solvers for non-convex problems include Genetic Algorithms, Convex-Concave/DC programming, BackProp (for ANNs), and Simulated Annealing.

In Simulated Annealing, we run the solver until it gets stuck. We then heat the system, making it jump over local humps. We then cool it down–we run the solver again until it stops. We repeat this over and over until we find the global minimum . A common application is in the simulation of liquids and crystals. To do this, we need to define a temperature.

#### Adding Temperature to Probabilistic Modeling

In traditional probability theory, we require the individual probabilities sum to 1:

Of course this is always true, but we don’t have to enforce this all the time; we can relax this constraint and simply enforce it at the end of the problem. This is exactly what we do in Simulate Annealing. Instead of creating probability models, we create Energy Models. The energies are essentially the log of unnormalized probabilities

Which is what we see in LME. We only require the total energy to be constant–not 1. This means when we compute

we need the normalization factor

#### Modeling with Energies

Lets say we have a bunch of objects that we would like to distribute (or cluster) into boxes, sets, or sequences in some optimal way. We might be placing balls into boxes, or we could be assigning documents into clusters, or we are finding personalized playlists. We want to find the optimal way of doing this through some constrained optimization problem. Imagine we have a container partitioned into small, disjoint boxes or sets . We have

- the total number of objects $latex \mathcal{N} $ is fixed. For us, the total number songs.
- the total ‘Volume’ V is fixed. This will come up later when we create the metric embedding

We wanted to find the optimal probability of a song to appear in a playlist. Here, more generally, we seek the probability , of putting an object in a box , subject to our prior knowledge of the problem. In other approaches, we might define a conditional probability for each box. Here, we say each box has an energy . We require total Energy of the system is always fixed. We can place objects into each box . We only require

Each combination, or set of numbers, constitutes a *distribution* . There are many ways of doing this; total number of combinations is given by the multinomial coefficients

Each combination has a specific configuration of energies . For example, if we place 4 objects A, B, C, and D into boxes 2, 3, 2, and 1, respectively the total energy of this configuration (or micro-state) is given by

Of course, this configuration is just like a playlist, except we have not accounted for the ordering explicitly here. We can consider other configurations as well

but they all have the same total energy . This leads to the constraints

The probabilities of being in each box is easy computed for small $latex \mathcal{N} $

where is the average number of objects in each box. And we can compute the average energy

We want to find the *most likely distribution* of objects, subject to the constraints on the distribution. But as soon as N gets large, we can not just evaluate these sums because they are too large. But we note that the *most likely distribution* is very highly peaked around

as is the most likely average Energy

Which means that we can approximate the most likely distribution by trying to find the minimum value of . Since this distribution is so peaked, we can approximate this very well by minimizing the log of the distribution

subject to our constraints

and .

We can solve this optimization using the method of Lagrange multiplies, as we do in many convex optimization problems in ML. This yields

where are the Lagrange multiplies (and which usually appear as adjustable parameters in other ML problems). We want to solve this problem in the limit . We take advantage of

##### Stirling’s Approximation

to write

Now we can evaluate the derivatives explicitly

With this simple expression, we can find the optimized, average number of objects in each box

where

and the optimal probability of being in each box

which is known as as the Boltzman Distribution, or , more generally, the Gibbs Measure.

The average energy of the system, which is a kind of optimal normalization for the problem, is given by

We now recognize that the energies are log probabilities

and we call the normalization factor the Partition Function

We call the inverse temperature, because in traditional thermodynamics.

We have completed out mission; we have introduced the notion of a Temperature into our probabilistic models.

#### Using the Temperature: Simulated Annealing

We should not use a formalism unless we intent to take advantage of it. Joachim’s uses the LME formalism to create a complicated model, but he only uses SGD to solve it. We can do better than this. Indeed, for Deep Learning, we will have to.

Of course, it is a bit more technical than this. And this is not the only way to solve the problem. Plus* we still need Regularization to avoid overfitting, and obtain sparse solutions. *We will live this to another post.

Still, we see that the language of Statistical Physics offers many advantages over traditional probabilistic models–and this why it is so popular in machine learning. In our next post we show how to combine the Entropy with an SVM to provide a very powerful, hybrid, semi-supervised learning method called a Transductive SVM.

It’s a very interesting post, it is always nice to see the fundamental principles of the methods we often use in ML (and sometimes forget what stands behind them). However, I find a few points difficult to digest. Could you shed some light on them?

1. I see you are looking for the most likely arrangement of the ‘balls’ in the boxes. But what allows you to switch from searching for the maximum likelihood to the minimum one? I refer to the $\min \log \Omega(n)$.

2. I don’t see the argument connecting logarithmic function with the ‘peakiness’ of the distribution. I have always been thinking that the the reason of the log-mapping stem from the monotonicity preservation of log for certain family of the logarithmic basis.

3. I would be interested in understanding why by increasing T (or decreasing the Lagrange multiplier) we enable the optimizer to ‘jump’ over the local minima. What is the algebraic argument? Is it somehow connected with the Stochastic Gradient Methods?

Thanks!

1. by the most likely arrangement i mean the one that is most probable. what is ‘the minimum one’?

2. because the distribution is highly peaked, when we take the log, the distribution does not really change that much. this was the classical argument; these days people kinda ignore this point and just assume that the log of the distribution is representative of the distribution, and that’s not the case in the tails of the distribution for small data sets

3. Yeah we don’t really adjust T in the SGD itself. What we do is take a random jump, and then accept or reject the jump based on temperature threshold. I need to explain in code so i will try to write some up.

what language is best? ruby? python? java? lua?

Thanks for the comments…its very helpful to have some feedback

on 1. I might have switched a sign let me check and get back to you

on 2. yes this is subtle but it is the intent. this is sometimes missed in ML

on 3. SGD only works for convex methods…

Hi, I was thinking of aggregating your posts into a toolkit (python perhaps?), do you think it would be useful?

we could certainly do something like this with some of the newer posts

I have a lot of ideas and not enough time to write up everything

usually I make a github account and post code there, as i did with the ruby work

I had some issues displaying the ipython notbook on wordpress and it is was unclear to me if i need to upgrade to a paid account to do this

do you know?

Hi, have you seen this post: http://www.josephhardinee.com/blog/?p=46 ?

yes this is for self-hosted wordpress where you can modify the css directly

it was unclear to me if histed wordpress would allow this

[…] discussed basic Statistical Mechanics and Deterministic Annealing in our previous […]

[…] Safe Semi-Supervised SVM methods. The S4VM tries to improve upon the SvmLin DA-TSM algo using Simulated Annealing, and then selects the best solutions by evaluating the final cluster […]