# Machine Learning with Missing Labels Part 2: The UniverSVM

Ever wonder what Google DeepMind is up to?  They just released a paper on   What is Semi Supervised Learning (SSL)? In this series of posts, we go back to basics and take a look.

Most machine learning algos require huge amounts of labeled data to achieve high accuracy. SVMs, Random Forests, and especially Deep Learning, can take advantage of massive labeled data sets.

How can we learn when we only have a few labeled examples?

In SSL, we try to improve our classifiers by taking advantage of unlabeled data.

In our last post we described an old-school approach to this problem, the Transductive SVM (TSVM), circa 2006. Here, we examine a different approach to learn from unlabeled data–Infernece with the Universum–ala Vapnik 2006.

Plus, there is code!  The UniverSVM code has a TSVM and the Universum (USVM) approach.

#### Wait, doesn’t Deep Learning use unlabeled data in pretraining?

This is different.   Deep Learning uses huge amounts of labelled data.    We want to use as little labeled data as possible.  To me, this is the real test of a learning theory.

In this and the next few blogs, we will examine at several variants of SVMs– USVM, WSVM, SVM+, and Semi Supervised Deep Learning methods–all that achieve high accuracy with only a small amount of labeled data.   Also, we only consider text classification–images and other data sets require different methods.

### Learning with Unlabeled Data

Let’s say with have a small number $N_L$ of labeled documents, and a large number  $N_U$ of unlabeled ones.   Can we build document classifier (i.e. a binary SVM) that generalizes well with just a little labeled data?

According to the Vapnik and Chervonenkis Statistical Learning Theory (VC-SLT), our generalization accuracy is very losely bounded by the model complexity and the number of training documents $N_L$:

$R(\alpha)-R_{emp}(\alpha)\leq2\mathcal{R}_{m}(\mathcal{H})+\kappa\sqrt{\dfrac{ln(\frac{1}{\delta})}{N_L}}$

which, in plain English, is just

Generalization Error $\leq$ Model Capacity ($2R_{m}$) $+$ Size Effects ($\ \sim\sqrt{\dfrac{1}{N_{L}}}$)

The VC-SLT  inspired the SVM maximum margin approach (although they are not equivalent).

SLT recognizes that for any data set $\mathcal{L}$ (of size $N_{L}$), we may obtain multiple, equivalent models (i.e. all having the same training accuracy):

$\Gamma_{L}=\left[\mathcal{H}_{1}\sim\mathcal{H}_{2}\sim\mathcal{H}_{3}\sim\cdots\right]$

We should select the model $\mathcal{H}$ from $\Gamma_{L}$ with the smallest capacity $2R_{m}(\mathcal{H})$; in an SVM, this means select the largest margin.

This is easiest to visualize when the slack or admissible error = 1

Each model $\Gamma_{L}$ is a set of hyperplanes that give same labeling. The left model is optimal because it has the largest ‘SVM Capacity’, which is essentially the volume carved out by the admissible hyperplanes.  Turns out, the best model also has the hyperplane with the largest margin, so

The maximun margin is a measure of the VC capacity…

but is it the best?

While the VC-SLT bound is very loose, it does suggest that we usually need a large number of labeled documents $N_L$ to obtain any reasonable production accuracy.

The SLT is, however, inherently a theory about Transductive Learning; the proof of the VC bounds requires first proving the Transduction bound (i.e. via Symmetrization). It has always been the dream of the VC-SLT program to develop a Transductive or SemiSupervised) method that can learn much faster than $\dfrac{1}{\sqrt{N_{L}}}$.

(By Semi-Supervised, we mean that the resulting model can be applied to out-of-sample-data directly, where as Transductive learning only applies to the known, unlabeled test set.  We will not distinguish between them here.)

We would hope to achieve $\dfrac{1}{\sqrt{N_{L}+N_{U}}}$ by adding in unlabeled data. When $N_U$ is very large, we win.

(Indeed, recently, Vapnik has shown it is actually possible to reduce this bound from $\dfrac{1}{\sqrt{N_{L}}}$ to $\dfrac{1}{N_{L}}$–if we can Learn Using Privileged Information (LUPI).   This means we can learn when $N_{L}=320$ whereas previously we needed $N_{L}=100,000$!  But this actually is akin to assigning additional data or even weights to each labelled example–and this is not the same as using just unlabeled data.  Perhaps we can learn the weights from the unlabeled data–but that’s for another post.)    So we are left with the un-nerving statement

If the max margin is the right measure, then the TSVM should work very well…

and yet, this has proved elusive.

What is the alternative ? Suppose instead of measuring the volume traced out by the hyperplanes, our models measured the volume between the convex hulls of the 2 classes:

Notice that now, the labeling on the right is better.

This is a broader measure of the diversity of the equivalence class. In SLT, this is related to the VC Entropy (another measure of VC capacity).

The Universum approximates this volume–or rather the VC Entropy– via the Principle of Maximum Contradictions.  (In fact, the Universum is not the only way to do this, as we shall see.)  A clever idea–but something hard to achieve in practice.  Let us compare and contrast the TSVM and the USVM to understand how to select the data sets they need and their limitations.

#### Transductive SVM (TSVM)

Transductive Inference requires a statistical replica $\mathcal{R}$ of the labeled data $\mathcal{L}$:

The replica $\mathcal{R}$ is not only the same size as $\mathcal{L}$, it has the same statistical qualities.  In particular, the label means converge:  $\rightarrow$ as $\lim N\rightarrow\infty$. (i.e. in a well balanced binary classifier,  this is zero)

In theory, we can always create a replica (or phantom sample) because we assume the labeled data itself is drawn from some common empirical process.  In modern VC-SLT, we think of the symmetrization process that creates $\mathcal{R}$ as a Rademacher process — meaning that we have replicated the training data but randomized the labels.

In practice, we need to select the replica from unlabeled data–and this is hard!

We hope that by adding unlabeled data, we can find the best solution by guessing the labels of the unlabeled data and then maximizing the margin on all the data.

We can apply Transduction SVMs  if we can create a large, unlabeled data set $\mathcal{U}$ that behaves like a statistical replica of $\mathcal{U}$, albeit much, much larger.   TSVMs allow us to increase the accuracy of a binary text classifier by adding in large amounts of unlabeled data.

Also, TSVMs extend the feature space, or hypothesis space $\mathcal{H}$.  This is critical for text classification since , frequently, we need to classify documents we have never seen before, and we encounter new words.  This is irrelevant for image classification.

If we have a collection of consumer blogs (about finance, beauty, sports, politics, …), with some labelled documents, and large amount of unlabeled ones.  We can create (1-vs-1) binary TSVM classifiers, such as finance vs beauty if we have a good way to select the unlabeled data, as described in our previous blog:

Essentially, the unlabeled documents must consist only of finance and beauty blogs, and in the same ratio as the training data.

TSVMs only work well for simple binary ( 1 vs 1 ) classification, and only when the document classes form simple clusters.  They don’t work for multi-class classifier because they can not handle ( 1 vs all ) data sets.

So while TSVMs do work, not just any unlabeled data set will do.  Indeed, I personally believe the key to a good TSVM is creating a well balanced unlabeled data.  Or, equivalently, estimating the fraction $r$ of (+/-) examples very well.      If the replica set is poor, or poorly understood, the TSVM results can be worse than just training an SVM on only the labeled data.

#### UniverSVM (USVM)

In 1998, and , later in 2006, Vapnik introduced a different kind of SVM–which also the allows learning from unlabeled data–but replaces the Principle of Maximum Margin with a more robust method called

the UniverSVM: the Principle of Maximum Contradictions

The idea is to add in data from classes that are significantly different from the 2 classes being separated:

To create a finance vs beauty classifier, we would add labelled and/or unlabeled documents from other categories, such as parenting, politics, sports, etc.   We then want a binary classifier that not only separates the 2 labeled classes, but is also as different  from the other class–the Universum.

We create 2 replicas of the Universum–one with all (+) labels, and one all (-).    We then add unlabeled documents (blue circles)  that lie in the margin between classes–or really in the space between the (+/-) classes:

The best Universum examples will lie in the convex hull between the finance and beauty documents; those inside the convex hulls will most likely be ignored.  Since all the u-labels are wrong, every equivalence class of hyperplanes will produce numerous contradictions on the Universum points (u):

The best model has the largest diversity on the Universum; in other words, the largest VC Entropy.   Vapnik has pointed out the following interpretation of the Universum: “[When trying to classify the labeled data, try to avoid extra generalizations.]“

The best model has the Maximum # of Contradictions on the Universum.

The UniverSVM (USVM)  represents a kind of prior information that we add to the problem; the difference is that instead of specifying a prior distribution, which is hard, we replace this with a more practical approach–to specify a set of specific examples.

Notice we could have just created a 3-class SVM (and theUniverSVM code provides this for direct comparison).  Or we could build 2-class classifier, augmented with some clustering metric to avoid the other class--in the same spirit the S4VM method.   See, for example, the recent paper on EMBLEM.  But in these simple approaches, the other class must only contain other–and that’s hard.

##### USVMs compared to TSVMs:

In the TSVM we have to be sure we are adding documents that are in the same classes as the training data.  In the USVM, we must be sure we are adding documents that do not belong to either class.

Also, with a TSVM, we need to know the fraction of (+/-) documents, whereas with the USVM we don’t. The USVM would seem to admit some data from all classes–in principle. In practice, we will do much better if it does not.

Most importantly, unlike the TSVM, the USVM optimization is convex. So adding in bad data may will not cause convergence problems as in the TSVM and thus degrade the model. At least we hope.

Also, as with the TSVM, we suspect the USVM will only work for small $N_{L}$ ; once $N_{L}$ grows above say 100 documents, we may not see much improvement.

### The UniverSVM Algorithm

The SVM approach,  inspired by SLT, implements a data dependent, but distribution independent, regularizer that lets us select the best model $\mathcal{H}$ from a class of equivalent hypotheses $\Gamma_{L}$.  What other methods select the optimal equivalence class $\Gamma_{r}$ ?

#### the Principe of Maximum Power

A model $H$ consists of an equivalence class of hyperplanes, say $h\in H_{\Gamma}$.  They all have the same accuracy on the labeled data $\mathcal{L}$.

Suppose we know a prior distribution $P(\omega)$ on the set of all possible hyperplanes.  Then we can define the power p* of the optimal model as

$p*=\int_{H_{\Gamma}*}d\mathbf{h}P(\mathbf{h})$

We rarely can define $P(\mathbf{h})$ mathematically..but we can approximate it.

Let us sample $P(\mathbf{h})$ by selecting $N_{U}$ unlabeled example documents; we call this set the Universum ($\mathcal{U}$).

We call the practical method the UniverSVM.  The key insight of the UniverSVM is that while we can not compute this integral, we can estimate it by measuring the number of contradictions the class of hyperplanes generates on points in $\mathcal{U}$.

##### The Principle of Maximum Contradictions:

We need a way to select the equivalence class $\Gamma$ with the maximum number of contradictions on $\mathcal{U}$.  As usual, we create a regularizer.

We augment the standard SVM optimization problem with a Universum regularizer $\Vert U(f(x_u))\Vert$

$\frac{1}{2}\Vert\omega\Vert_{2}^{2}+C_{l}\sum_{l\in\mathcal{L}}H(y_{l},f(x_{l},b))+C_u\Vert U(f(x_u))\Vert$

where H is the standard SVM Hinge Loss.

The regularizer $\Vert U(f(x_u))\Vert$ can also be defined through a symmetric Hinge loss, making the problem convex.  The UniverSVM code also contains a non-convex variant using a ramp-loss approach.  We leave the details to the academic papers.

#### Laplacian SVMs and the The Maximum Volume Principle

About the same time Vapnik introduced the UniverSVM, Nyogi (at the University of Chicago) introduced both the TSVM SvmLin code, as well as a new approach to Semi-Supervised learning, the Laplacian SVM (or LapSVM).

The TSVM maximizes the margin for the labelled and unlabeled data.  The USVM maximizes the contradictions on the unlabeled (UniverSVM) set.   In fact, both approximate a more general form of capacity–the maximum volume between the convex hulls of the data sets.  And this can be approximation using Laplacian regularization!  Let’s see how this all ties together.

The LapSVM optimization is

$\frac{1}{2}\Vert\omega\Vert_{2}^{2}+C_{l}\sum_{l\in\mathcal{L}}H(y_{l},f(x_{l},b))+C_{u}\Vert\mathcal{L}\Vert$

We see it is very similar to the USVM, but with a different regularizer –norm of the Graph Laplacian $\Vert\mathcal{L}\Vert$. There are several $\mathcal{L}$ to choose from but LapSVM uses the one corresponding to the Laplace-Beltrami operator on the data manifold.

The LapSVM has recently been applied to text classification, although we dont have a C or python version of the code to test like with do with SvmLin and the UniverSVM.

LapSVMs and related Manifold Learning approaches have motivated some very recent advances in Semi-Supervised Deep Learning methods, such as the Manifold Tangent Classifier.  This classifier learns the underlying manifold using contractive auto-encoder, rather than using a simple Laplacian–and this seems to work very well for images.

(We note that the popular SciKit Learn package includes Manifold Learning, but these are only the Unsupervised variants.)

We can relate the LapSVM approach to the VC-SLT, through the Principle of Maximum Volume:

the Norm of the Graph Laplacian is measure of the VC Entropy

Let us again assume we have to select the best equivalence class $\Gamma^{*}$ on our labeled data $\mathcal{L}$ from a set of hyperplanes $\mathbf{h}\in H_{\Gamma}*$.    Lacking a specific prior , we can assume a uniform distribution $P(\mathbf{h})=1$.

We then simply need to approximate the volume

$V_{r}(\mathbf{h})=\int_{\Gamma_{r}}d\mathbf{h}$

We need a way to compute V, so we assume there exists an operator $\mathbf{Q}$ such that

$V_{r}(\mathbf{h})\equiv\dfrac{\mathbf{h^{\dagger}Qh}}{\mathbf{h^{\dagger}h}}$

To this end, Vapnik et. al. introduce “a family of transductive algorithms which implement the maximum volume principle“, called Approximate Volume Regularization (AVR) algorithms(s).  They take the form

$\min Loss_{l}(\mathbf{h})+C_{u}\dfrac{\mathbf{h^{\dagger}Qh}}{\mathbf{h^{\dagger}h}}$

For many problems, $\mathbf{Q}$ can just be the Graph Laplacian

$\mathbf{Q}=\mathcal{L}$,

the specific form specified by the specific AVR.

If we write the general form of $\mathcal{L}$ as

$\mathcal{L}=I-D^{-\frac{1}{2}}WD^{\frac{1}{2}}$

W can be defined using the Gaussian Similarity

$W_{i,j}=exp(\dfrac{-\Vert x_{i}-x_{j}\Vert^{2})}{2\sigma^{2}}$

which has a single adjustable width parameter $\sigma$.

A more robust Laplacian uses the Local-Scaling Similarity

$W_{i,j}=exp(\dfrac{-\Vert x_{i}-x_{j}\Vert^{2})}{2\sigma_{i}\sigma_{j}}$

which has N adjustable parameters $\sigma_{i}$.

the Maximum Volume Principle may be a better measure of VC capacity

This Principle of Maximum Volume has been applied in a number of ways, such as in recent papers on , and

Most importantly, a very recent paper proposes a MultiClass Approximate volume Regularization Algorithm (MAVR).

We have also connected to seemingly different approaches to machine learning–traditional VC theory and operator theoretic manifold learning.  In a future post, we will look at some practical examples and compares how well the available open source TSVM and USVM codes on text data.

# Machine Learning with Missing Labels: Transductive SVMs

SVMs are great for building text classifiers–if you have a set of very high quality, labeled documents.

Frequently, we just don’t have enough labelled data!  What can we do?

• We can crawl the web for more labelled data. For this purpose, here at Calculation Consulting we built a production quality crawler
• Pay mechanical Turks to label the documents.  This is actually harder than it sounds.
• We can use what labels we have, and try to guess the labels of the unlabeled documents.

Guessing labels is called Transductive Learning and is the topic of this post.

### SVMs and Self-Training

We have a huge collection of documents, perhaps with just a few labels.  The obvious, or perhaps naive, thing to do is build a classifier using what we have, and then try to scale it out incrementally by self-training.  That is, we select documents labeled that are labelled high confidence by the smaller model and use them to build a training data set for a larger SVM model:

So this seems pretty simple — what could go wrong? Since the initial model has very little labeled data, it is going to make lots of mistakes, even in the high confidence regime.   We are adding in some data that is mis-labeled.    As we incrementally self- train,  we amplify these errors, thereby reducing overall accuracy.  So what can be done?

### Transductive Learning

Suppose that, as we add the new data in, instead of just naively using our smaller model to label the new data, we could automatically somehow switch the labels on these new documents, and then pick the best model from all these possibilities.    That is, we run an SVM where we retain the labels on the labeled (blue) documents, but we let the labels on the test (purple) documents float.  We constrain the problem so that at least $r$ unlabeled documents are labeled positive.  We then pick the model with the largest margin. This leads to the following non-convex optimization problem for the Transductive SVM (TSVM)

$\min\dfrac{\lambda}{2}\|w\|^{2}+\dfrac{1}{2L}\sum\limits_{i=1}^{L}l(y_{i},w^{T}x_{i})+\dfrac{\lambda'}{2U}\sum\limits_{j=1}^{U}l(y_{j},w^{T}x_{j})$

subject to the constraint $\dfrac{1}{U}\sum\limits_{j=1}^{U}\max[0,sign(w^{T}x_{j}')]=r$

where the $l$ is the loss function (such the SVM hinge loss or, as here, the L2_SVM loss).  There are $L$ labelled documents $(x_{i})$ and $U$ unlabeled documents $(x_{j})$, and $\lambda,\lambda'$ are adjustable parameters.  The constraint $r$ fixes the number of positively labeled documents; this constraint I worry about the most because it is the one I use to tune the TSVM parameters — but let’s just assume we can estimate this and move forward.

#### Why does Adding Test Data to the Training Set Improve the Accuracy?

When we don’t have enough training data, the SVM optimization problem is still convex, but the optimal solution may be spurious.  Imagine that we are trying to train a binary text classifier with only say 100 documents in each class.  The classifier will learn how to separate the documents, but it will not be able to generalize to new documents because it simply has not seen enough examples or really enough words to build a usable feature space.    The solutions may be spurios.

Indeed, there may exists many margins that separate the data, and the largest may not be the one that generalizes best.  That is, without enough training data, the SVM may mis-learn and think that noise words are driving the  classification.

We want is a classifier that generalizes to unseen test documents;

if we don’t have test labels, perhaps we can guess them.

When we add unlabeled documents , or test data (circles) , to the classifier training data.  If we can guess the labels of the circles correctly, we can find the optimal solution because it is less ambiguous:

Notice that since we are seeking a maximum margin solution, we implicitly are assuming that the documents form clusters in feature space because the margin lives in the low density regime between document clusters.  We are also are increasing the size of feature space, thereby greatly improving the generalizabilty of the model.

### TSVM Algorithm: Deterministic Annealing

Unlike a standard (inductive) SVM, the TSVM optimization problem is highly non-convex and quite difficult to solve exactly.  Clearly we can not switch the labels on all the new documents; if we add in N documents, then we have $2^{N}$ combinations to evaluate.  So how do we solve it?

There are several known approaches to solving the TSVM–both of which are readily available in the open source package SvmLin:

• Label Switching Heuristics  (-A 2) , and
• Deterministic Annealing (-A 3)

We will discuss the Deterministic Annealing (DA) algorithm because it is very nice example of how to combine the ideas of SVMs with the techniques of Statistical Mechanics discussed in our last post.   Other, more modern Transductive learning algos may use alternate methods such as convex-concave programming  (CCCP, available in Universvm  [11]) ,  and even quasi-Newton gradient descent (QN-S3VM) [10].

In the original paper [3], TSVM  refers to option  -A 2, and DA to -A 3.  Here, we use the term TSVM to mean any implementation (TSVM, DA, CCCP, QN-S3VM).   Because we are applying the TSVM to text, however, we do not advocate using label propagation or other graph based methods.

#### Combining SVMs with Statistical Mechanics

We previously, discussed basic Statistical Mechanics and Simulated Annealing, where we derive the concept of Entropy.

To solve the  TSVM / DA optimization problem, we need relax the constraint on the labels that they be integers, and, instead let them represent the probability of being in the (+) or (-) class.  We then introduce a temperature $T$.   First, lets get some labels that range from 0 to 1:

$\mu_{j}=\dfrac{1+y_{j}}{2}$ so that

$\mu_{j}=1$ when $y_{j}=1$,  and

$\mu_{j}=0$ when $y_{j}=-1$

This lets us re-write the TSVM optimization problem as

$\min\dfrac{\lambda}{2}\|w\|^{2}+\dfrac{1}{2L}\sum\limits_{i=1}^{L}l(y_{i},w^{T}x_{i})+\dfrac{\lambda'}{2U}\sum\limits_{j=1}^{U}(\mu_{j}l_{2}(y_{j},w^{T}x_{j})+(1-\mu_{j})l_{2}(y_{j},-w^{T}x_{j}))$

where we write $l_{2}$ to specifically indicate that the loss function is the L2-SVM loss.

We now take a trick from the Ising model in Statistical Mechanics.  We want to transform the labels

$\mu_{j}\rightarrow p_{j}$

into probabilities that range from 0 to 1:  $p_{j}\in[1,0]$

and are normalized to the total number of expected positively labeled instances: $\dfrac{1}{U}\sum_{j=1}^{u}p_{j}=r$

How do we incorporate probabilities into our SVM optimization problem?  We are already selecting the maximum margin of the training data; we now need to also select the optimal probabilities — the maximum Entropy — of the labels for the unlabeled data.

We extend the TSVM max-margin problem, adding a term that maximizes the total Entropy  $S$ of the possible configurations of the unlabeled test data

$S=-\sum_{j=1}^{u}p_{j}\ln(p_{j})+(1-p_{j})\ln(1-p_{j})$

Actually, since we minimize $\|w\|^{2}$, we actually minimize the negative Entropy $-S$.  This leads to a new TSVM / DA optimization problem

$\min\dfrac{\lambda}{2}\|w\|^{2}+\dfrac{1}{2L}\sum\limits_{i=1}^{L}l(y_{i},w^{T}x_{i})+\dfrac{\lambda'}{2U}\sum\limits_{j=1}^{U}(\mu_{j}l_{2}(y_{j},w^{T}x_{j})+(1-\mu_{j})l_{2}(y_{j},-w^{T}x_{j}))+\dfrac{T}{2U}\sum_{j=1}^{u}p_{j}\ln(p_{j})+(1-p_{j})\ln(1-p_{j})$

where the additional parameter  $T$ is a Lagrange Multiplier for Entropy $S$ .  We also have to add in the normalization (and class balancing) constraint

$\dfrac{1}{U}\sum_{j=1}^{u}p_{j}=r$

Handing this constraint is a bit more complicated, and we refer to [3];  we do note that the constraint can be re-written into what looks like a Fermi energy, which is quite interesting.

We maximize the Margin on the training (and test) sets,

and maximize the Entropy on the test set.

Solving this optimization problem requires alternating between solving the labeled SVM optimization problem (maximizing the margin $w$) and finding the optimal probabilities for the unlabeled instances (optimizing $p_{j}$).  The result are labels (with confidences) for our previously unlabeled data, and, if we want to use it, a new , inductive SVM model.

We also need to be careful not to overtrain.  So I will select out the newly labeled data with the highest confidence labels and then use this as training data for the production SVM (i.e LibLinear).  This would allow the data scientist to then take a label set of say 100 documents per class, and automatically extend it to say 10,000 labeled documents per class without the use of mechanical Turks or crawling the web.

Despite the wonderful advantages promised by Transductive (and SemiSupervised) Learning, it is quite hard to apply in practice and has not achieved even close to the popularity of its Supervised and Unsupervised counterparts.  TSVMs need both proper tuning [8] and well designed data sets.  We need to understand it’s practical implementation much better to use it.

#### Practical Issues:  Selecting the best labelled data

An initial training data set may be naively separable if it is small and has a large number of features :

naively separable:

$N_{instances}\ll N_{features}$

As a data set grows, we may have far more instances than features:

potentially unseparable:

$N_{instances}\gg N_{features}$

which makes it far more probable that some instances will lie in the slack region between classes.

We usually run a tool like LibLinear , which implements either a soft-margin SVM or L2-regularized Logistic Regression.  Indeed, in practice is usually impossible to tell the difference between these 2 models.  The additional slack terms can be though of as an extended regularizer or as modifying the SVM loss function.  Indeed, even Vapnik himself has argued that it can be very beneficial to run an SVM with a large number of adjustable slack variables (i.e in master learning and/or his universum model [5]). The SvmLin implementation of the TSVM is a hard margin (binary, no slack variables) SVM, but it does use the L2-SVM loss function (and not the harder hinge loss).

It is suggested to run the TSVM slowly and repeatedly to relabel documents, and, on each iteration, to only take new labels that have moderately high confidence (and therefore, hopefully, outside the slack region).

Of course, there is always the risk of biasing the data set, and it may be better to augment the selection process by selecting documents that are high confidence and close. For example, in the S4VM approach, they  high confidence documents that are also close to the labeled documents, as determined through a hierarchal clustering method.

In between each stage, it would be useful to apply a mechanical turk to check the results to ensure the newly labelled data is not bananas.

#### Practical Issues:  designing multiple 1-vs-1  data sets

Additionally, SvmLin only implements a binary SVM, whereas many practical applications require a multi-class SVM.  More generally, one might try to extend the TSVM to a DA type algorithm using say a Potts model–a classic model in physics that generalizes an Ising model. MultiClass Transductive Learning is a topic of current research.  For now we have to use SvmLin or a related package (Universvm, QN-S3VM) to get practical solutions.

In a normal SVM, balancing the class data is important; in a Transducitve-SVM, it is absolutely critical.  For example, this recent study on TSVMs discussed the issue.  This can be quite challenging in real world, production environments, and if you want to apply this powerful technique, extra care should be taken to design the TSVM (training + test) sets.

To apply SvmLin, it is necessary to carefully construct a collection of 1-vs-1 binary SVMs;  current TSVMs don’t work for 1-vs-all data sets [4]

For example, suppose we have a  4 class model, but we only have a few labeled instances for each class.  If we try to extend the set of instances for class 1, we need to carefully construct four 1-vs-1 binary TSVMs, and not one large data set for a 1-vs-all binary TSVM.  Each individual binary TSVM classifier should be designed carefully so that each class (+/-) consists of a single kind of document;.  Otherwise the solver will get trapped in a local solution and the TSVM will be worse.

Furthermore, the fraction $r$  is a normalization constraint, and keeping it fixed means the negative set must be carefully constructed to ensure that final result has $r=1/2$ positively labeled instances that represent the actual positive class well.  This is not easy to ensure and it may be necessary to sample the unlabeled data using mechanical turk or advanced techniques to estimate the true fraction  $r$ in the unlabeled data.

The TSVM can help add labeled data to a well designed model; it can not be used to repair a poorly designed classification model .

Consider a binary SVM trained on a small training set of cat and dog videos.   We then apply the model to some unseen test videos, and select a large number of videos that were labelled either cats or dogs.  The model will certainly make mistakes, but  we can hope that the errors are balanced across both classes.  So when we apply the TSVM to the entire training + test  set, half of the data will still actually be dog videos.

On the other hand, let’s consider the poorly designed classifier shown below.  Here, we have 3 classes:  dogs, cats, and animals.  The third class is trained for animal videos, and may contain dogs, cats, horses, mice, etc.  This is a bad design, and we would see this, say, in the confusion matrix for the entire model.  Even if we create three 1-vs-1 TSVMs, we are not going to get anywhere.  When we create the increment set, to train the TSVM, we have no idea what fraction of the training + test set are actually dog videos.  So we can not really fix the normalization constraint $r$, and resulting TSVM model may behave screwy.  We would need a way to estimate $r$.

Furthermore, the animals category will likely consist of a number a document clusters (dog, cat, horse, mouse, ferret, etc) and the TSVM solver will not converge properly for say the dog/animals binary TSVM.  The multiple clusters will confuse the solver.

Each  (+/-) class must effectively capture a single, simple text cluster.

#### Epilogue

Transductive Learning was first proposed by more than thirty years ago by Vapnik and Chervonenkis [6] , and it is a critical idea in the development of the famous VC Theory of Statistical Learning [1].  In a future post, we will look in detail at the role of Transductive Inference in the VC theory and some other formulations such as the UniverSVM and Instance Weighted SVMs [6].

Practically,  there are few useful Transductive and Semi-Supervised ML algos to use :

also note

• The famous scikit-learn package has 1 method, a semi-supervised Label Propagation algorithm. but this is not suitable for text
• It appears the enterprise tool RapidMiner does implement a commercial version of a TSVM, although I personally have not tried it.

In our next post,  we will run some experiments with the SvmLin code and see how well it works on some real world datasets.

At some point in the futurem we wlll discuss [12] and what (we think) is going on at Google Deep Mind.

To get involved, goto    https://github.com/CalculatedContent/tsvm

#### References

[1]  Vladimir Vapnik. Statistical learning theory. Wiley, 1998. ISBN 978-0-471-03003-4.

[2] T. Joachims, Transductive Inference for Text Classiﬁcation using Support Vector Machines, ICML 1998

[3] V. Sindhwani, S. Sathiya Keerthi, Large Scale Semi-supervised Linear SVMs, 2006

[4] O. Chapelle, B. Scholkopf, A. Zien, Semi-Supervised Learning, 2006

[5]  http://www.cs.princeton.edu/courses/archive/spring13/cos511/handouts/vapnik-slides.pdf

[6] Vapnik, V., & Chervonenkis, A.  The Theory of Pattern Recognition, Moscow 1974

[7]  Ran El-Yaniv, Dmitry Pechyony, Transductive Rademacher Complexity and its Applications

[8] J. Wang,  X. Shen,  W. Pan On Transductive Support Vector Machines

[10] Fabian Gieseke, Antti Airola, Tapio Pahikkala, and Oliver Kramer. Fast and Simple Gradient-Based Optimization for Semi-Supervised Support Vector Machines, Neurocomputing (ICPRAM 2012 Special Issue), Elsevier, accepted;  Fabian Gieseke, Antti Airola, Tapio Pahikkala, and Oliver Kramer. Sparse Quasi-Newton Optimization for Semi-Supervised Support Vector Machines, Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods (ICPRAM 2012), pages 45-54, 2012. [pdf]

[11]  Ronan Collobert ,Fabian Sinz  Jason Weston  Leon Bottou,  Large Scale Transductive SVMs   Journal of Machine Learning Research 7 (2006)

[12] Semi-supervised Learning with  Deep Generative Models  at Google Deep Mind

# Metric Learning: Some Quantum Statistical Mechanics

I wrote this for my buddy Sebass; just a quick review of quantum stat mech:

In our last post, we presented a formalism for music recommendations called Logistic Markov Embedding (LME).  This  technique uses some mathematics that arises in other modern machine learning methods, such as Maximum Entropy  (MaxEnt) methods, and the Restricted Boltzmann Machines (RMBs) in Deep Learning.

In this post, we are going to take a step back and  understand where the idea for a metric embedding comes from and the mathematical formalism employed by Joachims. The Metric Embedding in MLE is a kind of Energy model  — an idea from Quantum Statistical Mechanics.So we will review the pedagogic formalism of  Statistical  Mechanics for Thermodynamics.

We will follow the classic text by T. Hill, and one of the first books I ever read in college.  Also, we will not look at more abstract version of Quantum Statistical Mechanics because we don’t need to deal with the phase space formulation until we look at time-dependent processes like Causality (but eventually we will get there).  For a deeper view of this subject, we defer to the classic text by Jaynes, Probability Theory: The Logic of Science.

# Probabilities and the Partition Functions

Frequently we see expressions for probabilities, as in the LME approach,  written as:

$Pr(p)=\underset{i}\Pi\dfrac{\exp(-\Delta(p^{i},p^{i-1})^{2})}{Z(p^{i-1})}$

where $Z(p^{i})$ is partition function, given by

$Z(p^{i})=\sum_{j=1}^{|S|}\exp(-\Delta(p^{i},s_{j})^{2})$

Naively, we can just think that we are defining the probability in an abstract way, and that $Z(p^{i})$ is just the normalization.  But there’s more to it.  In fact, it is deeply connected  the Boltzman / Gibbs Distribution

So why in the world do we do this?

#### Convex and non-Convex Optimization

In simpler machine learning algos like SVMs and Logistic Regression, we can readily solve the optimization problems  using Stochastic Gradient Descent (SGD)– because these are convex optimization problems.

Below we see two optimization surfaces: our goal is to find the global minimum.   The top problem is easy to solve, and just about any decent descent method will do (unless we need sparse solutions, which makes this a little trickier).  The bottom problem, however, is filled with hills and bumps, and even looking at it, we are not sure where the bottom is.

So how do we find it?  Typical solvers for non-convex problems include Genetic Algorithms, Convex-Concave/DC programming,  BackProp (for ANNs), and Simulated Annealing.

In Simulated Annealing, we run the solver until it gets stuck.  We then  heat the system, making it jump over local humps. We then cool it down–we run the solver again until it stops.  We repeat this over and over until we find the global minimum .  A common application is in the simulation of liquids and crystals. To do this, we need to define a temperature.

#### Adding Temperature to Probabilistic Modeling

In traditional probability theory, we require the individual probabilities sum to 1:

$\sum P_{1}=1$

Of course this is always true, but we don’t have to enforce this all the time; we can relax this constraint and simply enforce it at the end of the problem.  This is exactly what we do in Simulate Annealing. Instead of creating probability models, we create Energy Models.  The energies are essentially the log of unnormalized probabilities

$E_{j}\sim\log(P_j)$

Which is what we see in LME.  We only require  the total energy $E_t$ to be constant–not 1.  This means when we compute $P_j$

$P_{j}=\dfrac{\exp(-\beta E_j)}{Z}$

we need the normalization factor $Z$

$Z=\sum \exp(-\beta E_i)$

#### Modeling with Energies

Lets say we have a bunch of objects that we would like to distribute (or cluster) into  boxes, sets, or sequences in some optimal way.  We might be placing balls into boxes, or we could be assigning documents into clusters, or we are finding personalized playlists.  We want to find the optimal way of doing this through some constrained optimization problem.   Imagine we have a container partitioned into small, disjoint boxes or sets $j$.  We have

1.  the total number of objects $latex \mathcal{N}$ is fixed. For us, the total number songs.
2.  the total ‘Volume’  V is fixed.  This will come up later when we create the metric embedding

We wanted to find the optimal probability of a song to appear in a playlist.  Here, more generally, we seek the probability $P_{j}$, of putting an object in a box $j$, subject to our prior knowledge of the problem.  In other approaches, we might define a  conditional probability for each box.  Here, we say each box has an energy $E_j$ . We require total Energy $E_{t}$ of the system is always fixed. We can place  $n_{j}$ objects into each box $j$. We only require

$\sum_{j}n_{j}=\mathcal{N}$

Each combination, or set of numbers,  ${n_1,n_2\cdots}$ constitutes a distribution $\Omega_{t}(n)$.  There are many ways of doing this;  total number of combinations is given by the multinomial coefficients

$\Omega_{t}(n)=\dfrac{\left(n_{1}+n_{2}\cdots\right)!}{n_{1}!n_{2}!\cdots}$

$\Omega_{t}(n)=\dfrac{\mathcal{N}!}{\prod_{j}n_{j}!}$

Each combination has a specific configuration of energies $E_{j}$.  For example, if we place 4 objects A, B, C, and D into boxes 2, 3, 2, and 1, respectively the total energy of this configuration (or micro-state) is given by

$E_{1}+2E_{2}+E_{3}=E_{t}$

Of course, this configuration is just like a playlist, except we have not accounted for the ordering explicitly here.  We can consider other configurations as well

but they all have the same total energy $E_t$.  This leads to the constraints

$\sum_{j}n_{j}E_{j}=E_{t}$

The probabilities of being in each box $P_j$ is easy computed for small $latex \mathcal{N}$

$P_{j}=\dfrac{\bar{n_{j}}}{\mathcal{N}!}$

where $\bar{n_{j}}$ is the average number of objects in each box.   And we can compute the average energy $\bar{E}$

$\bar{E}=\sum_{j}P_{j}E_{j}$

We want to find the most likely distribution $\Omega(n*)$ of objects, subject to the constraints on the distribution. But as soon as N gets large, we can not just evaluate these sums because they are too large.  But we note that the most likely distribution $\Omega(n*)$ is very highly peaked around   $n*$

as is the most likely average Energy

Which means that we can approximate the most likely distribution by trying to find the minimum value of $\Omega(n*)$.  Since this distribution is so peaked, we can approximate this very well by minimizing the log of the distribution

$\min\log(\Omega(n*))$ subject to our constraints

$\sum_{j}n_{j}=\mathcal{N}$ and $\sum_{j}n_{j}E_{j}=E_{t}$.

We can solve this optimization using the method of Lagrange multiplies, as we do in many convex optimization problems in ML.  This yields

$\dfrac{\partial}{\partial n_{j}}\left[\ln\Omega_{t}(n)-\alpha\sum_{i}n_{i}-\beta\sum_{i}n_{i}E_{i}\right]=0$

where $\alpha,\beta$ are the Lagrange multiplies (and which usually appear as adjustable parameters in other ML problems). We want to solve this problem in the limit $\lim\mathcal{N}\rightarrow\infty$.  We take advantage of

##### Stirling’s Approximation

$\ln\mathcal{N}!\cong\mathcal{N}\ln\mathcal{N}-\mathcal{N}$

to write

$\ln\Omega_{t}(n)=\left(\sum_{i}n_{i}\right)\left(\sum_{i}n_{i}\ln n_{i}\right)-\left(\sum_{i}n_{i}\ln n_{i}\right)$

Now we can evaluate the derivatives explicitly

$\ln\left(\sum_{i}n_{i}\right)-\ln\left(n_{i}^{*}\right)-\alpha-\beta E_{j}=0$

With this simple expression, we can find the optimized, average number of objects in each box

$n_{j}^{*}=\mathcal{N}e^{-\alpha}e^{-\beta E_{j}}$

where

$e^{\alpha}=\sum_{j}e^{-\beta E_{j}}$

and the optimal probability of being in each box

$P_{j}=\dfrac{n_{j}^{*}}{\mathcal{N}!}=\dfrac{e^{-\beta E_{j}}}{\sum_{j}e^{-\beta E_{j}}}$

which is known as as the Boltzman Distribution, or , more generally, the Gibbs Measure.

The average energy of the system, which is a kind of optimal normalization for the problem, is given by

$\bar{E}=\dfrac{\sum_{j}E_{j}e^{-\beta E_{j}}}{\sum_{j}e^{-\beta E_{j}}}$

We now recognize that the energies are log probabilities

$E_{j}\sim\log(P_j)$

and we call the normalization factor the Partition Function $Z$

$Z=\sum_{j}e^{-\beta E_{j}}$

We call $\beta$ the inverse temperature, because   $\beta\sim\dfrac{1}{T}$ in traditional thermodynamics.

We have completed out mission; we have introduced the notion of a Temperature into our probabilistic models.

#### Using the Temperature:  Simulated Annealing

We should not use a formalism unless we intent to take advantage of it.  Joachim’s uses the LME formalism to create a complicated model, but he only uses SGD to solve it.  We can do better than this.  Indeed, for Deep Learning, we will have to.

Of course, it is a bit more technical than this.  And this is not the only way to solve the problem. Plus  we still need Regularization to avoid overfitting, and obtain sparse solutions.   We will live this to another post.

Still, we see that the language of Statistical Physics offers many advantages over traditional probabilistic models–and this why it is so popular in machine learning.  In our next post we show how to combine the Entropy with an SVM to provide a very powerful, hybrid, semi-supervised learning method called a Transductive SVM.

# Music Recommendations and the Logistic Metric Embedding

In this post, we are going to see  how to build our own music recommender, using the Logistic Metric Embedding (LME) model developed by Joachims (of SVMLight fame). The core idea of this recommender is that people listen to songs in a specific sequence, and that certain songs sound better when they follow other songs.

We will break this post into 2 parts:

• First, we will summarize the key ideas of the mathematical formulation of the LME.
• Second, we will actually run the LME software and see how to build our own music playlists.

But first, a little history

Introduction:  Some History of Sequence Prediction

I first began working in sequence prediction in about 2001, where I developed the Effective Operator approach to Personalized Recommenders.  This was a recommender system that used a kind of Bayesian / Regularized Linear Regression.  Some years later, working with eBay, we researched the effectiveness of search relevance (a kind of sequential recommendation) using techniques including Joachims’ earlier work on Structural SVMs.  Recent work has focussed on built upon the  Structural SVMs for Ranking, leading to  a new, simpler  approach, Metric Learning to Rank.   Metric Embedding is, in general, a good thing to know about, and you can learn about it more generally from the University of Chicago course: CMCS 39600: Theory of Metric Embeddings.  Joachims has subsequently also considered metric learning, and here, we examine some his recent research in metric learning for sequence prediction.  Specifically, we look in detail at the Logistic Metric Embedding (LME) model for predicting music playlists.

Sequence Prediction with Local Metric Embeddings

We would  like to recommend playlist of songs.  More generally, we seek to estimate the probability of observing a new, directed sequence $D$, given a training set of a very large set of directed sequences  $\mathcal{S}^{d}_{train}$.  To do this, we define a conditional, probabilistic, vector space model  that retains the local sequence structure of the sample data, and we try to learn the optimal model via regularized maximum-likelihood estimation (MLE) .

Developed by Joachims (and following his notation), we seek to predict a (directed) sequence of states $p=\{p^{1},p^{2}\cdots p^{d}\}$ (i.e. an ngram sequence, a playlist of songs, etc. ) from a large collection of states $S=\{s_{1},s_{2}\cdots\}$, where $p^{i}\in S$.   We model each sequence $p$ as a Markov Chain, so that

$Pr(p)=\underset{i}{\Pi}Pr(p^{i}|p^{i-1})$

We embed the sequence into the Latent space, or the d-fold Tensor space  (which is isomorphic to a $(|S|\times d)$ dimensional Euclidean space )

$\mathcal{S}^{d}=S\times S\times\cdots S\cong\mathbb{R}^{|S|\times d}$, such that  $p\in\mathcal{S}^{d}$

The inference problem is to find $Pr(p)$ ; it is  solved using a straightforward Stochastic Gradient Descent (SGD) algorithm, restricted to a kind of nearest-neighbor sampling called a Landmark Heuristic

Because we have a Markov Model, we will see that all we need is to write down the local log-Likelihood, defined through the conditional transition probabilities $Pr(p^{i}|p^{i-1})$  , and, as usual, some regularization.  Joachims ignores long range interactions and only regularize locally, and I will note some opportunities to address this.

Also, we note that assuming a Markov Model is actually restrictive (Recall our post on Brownian motion and memory effects in chemical physics) .  First, it assumes that to predict the next song, the only songs that matter are the ones we listened to before this song.  Additionally, Markov Models can ignore long-range interdependencies.   A NMF based recommender does not assume this, and they include long-range effects.

Recent work (2013) shows how to parallelize the method on distributed memory architectures, although we will do all of our work on a shared memory architecture. they even provide sample code!

#### Visualizing the Embedding

For example, a typical embedding is visualized as:

The Online LME Demo provides an interactive view of the embedding.  It also includes simple modifications for adding popularity, user bias, and even semantic tags to the model (not discussed here)

#### Single and Dual Point Logistic Metric Embeddings

Joachims introduces 2 different models, called the Single Point and Dual Point models.  We  should only care about the Dual Point model, although, curiously, in the published paper (2012), the Dual Point model does not work better than the Single Point model.  So, for now, we describe both. Also, the code contains both models.

By a Metric Embedding, we mean that the conditional probability is $Pr(p^{i}|p^{i-1})$ is defined with the metric, or distance $\Delta(p^{i},p^{i-1})$, in the Latent space $\mathcal{S}^{d}\cong\mathbb{R}^{|S|\times d}$ where we have embedded our states (songs).

$Pr(p^{i}|p^{i-1})\sim\exp(-\Delta(p^{i},p^{i-1})^{2})$

What is the Embedding?

We associate each state $s_{i}$  with a vector $\vec{X}(i)\in\mathbb{R}^{|S|}$

$s_{i}\rightarrow\vec{X}(i)$

Notice Joachims writes $\vec{X}(p^{i})$ and $\vec{X}(s_{i})$ to distinguish between vectors associated with elements of a playlist and arbitrary elements of $S$.

In a model for , say, playlists, each vector  it is , initially, just a point $\vec{X_{0}}(i)$ on a $|S|$-dimensional hypercube.  We then run an inference algorithm which learns the optimal vectors $\vec{X_{opt}}(i)$ that minimize the total regularized log-likelihood $L$, as computed over the entire training data, and subject to the specific model chosen.

What is $\Delta$ ?

In the Single Point model, $\Delta_{1}$ is just the Euclidean distance between  $X(i)$ and $X( i-1)$

$\Delta_{1}(p^{i},p^{i-1})=\parallel\vec{X}(i)-\vec{X}(i-1)\parallel_{2}$

This is just a standard Vector Space model.

In the Dual Point model, however, we associate 2 points (vectors)  $\left(\vec{U}(i),\vec{V}(i)\right)\in\left(\mathbb{R}^{|S|},\mathbb{R}^{|S|}\right)$, with each state $s^{i}$ within a  sequence $s$.  We call these the entry $\vec{U}(i)$ and exit $\vec{V}(i)$ points.

$s_{i}\rightarrow\left(\vec{U}(i),\vec{V}(i)\right)$

The distance between 2 states the asymmetric divergence

$\Delta_{2}(p^{i},p^{i-1})=\parallel\vec{U}(i)-\vec{V}(i-1)\parallel_{2}$

This is like a Vector Field Model, where each state is mapped to a directed vector

$\vec{\Delta}(i)=\vec{V}(i)-\vec{U}(i)\in S$.

(In mathematical physics, the embedding is a called a Vector Field or, more generally, a Fiber Bundle ).

The Latent metric is the Euclidean distance between the end point of $p^{i-1}$ and the starting point of $p^{i}$ within a directed sequence $p$.

We can also construct sequence-independent ($\bar{U}(i),\bar{V}(i)$)  vectors by simple averaging over a large sample of known sequences (i.e. song playlists).

$i\rightarrow\left(\bar{U}(i),\bar{V}(i)\right)$

#### Model Inference

We want to find

$Pr(p)=\underset{i}\Pi\dfrac{\exp(-\Delta(p^{i},p^{i-1})^{2})}{Z(p^{i-1})}$

where  $Z(p^{i})$ is the point partition function, given by

$Z(p^{i})=\sum_{j=1}^{|S|}\exp(-\Delta(p^{i},s_{j})^{2})$

This is just the normalization of the conditional probability (above).   Note that  the  sum is over all possible states $s_{j}\in S$, and for numerical performance, we will restrict the sum to nearby states.

Actually, the final model used is somewhat more complicated because it includes other features to adjust for popularity, user bias, diversity, etc.

#### Final LME Model

$Pr(p)=\underset{i}{\Pi}Pr(p^{i}|p^{i-1})=\underset{i}\Pi\dfrac{\exp(-\alpha\Delta(p^{i},p^{i-1})^{2})+\beta b+\gamma\Delta(p^{i},p^{0})^{2}}{Z(p^{i-1},p^{0},\alpha,\beta,\gamma)}$

The optimal solution is found by maximizing the  log-Likelihood $\log(Pr(p))$, subject to regularization to avoid over-training.

(in a later post, I will explain the partition function ad where it comes from;  seriously, I have most of this written up already)

For the Single Point Model, the objective function is the log-Likelihood plus a standard 2-norm regularizer

$f_{1}(\bf{x})=\underset{U,V}{\sum}L_{1}(D|X)-\lambda\parallel\bf{X}\parallel^{2}_{F}$

For the Dual Point Model, the objective function includes 2 kinds of regularizers

$f_{2}(\bf{x})=\underset{U,V}{\sum}L_{2}(D|U,V)-\lambda\Omega(U,V)-\nu\underset{s\in S}{\sum}\Delta_{2}(s,s)_{2}$

The first regularizer is

$\bf{\Omega(U,V)}=\parallel\bf{U}\parallel^{2}_{F}-\parallel\bf{V}\parallel^{2}_{F}$

and ensures that the entry and exit vectors don’t get too large.  It is similar to the kind seen in traditional matrix factorization techniques.

The second regularizer ensures the small length sequences, and is similar in spirit to the kind of path regularizers seen  in Minimum Path Basis Pursuit methods (where, ideally, we would seek use an L1-norm for this).  This part is critical because it allows to decompose what looks like a large, messy, non-convex, global optimization problem into a convex sum of local, albeit non-convex , optimization problems:

A Local Manifold Optimization Problem

Here is the real trick of the method

By including  the dual point, local path regularizers $\Delta_{2}(s_{a},s_{b})^{2}$,  we can represent the global log-Liklihood as a sum over local log-Likelihoods, describing regularized transitions from $s_{a}\rightarrow s_{b}$

We re-write the  objective function $f_2(\bf{x})$ (here, Joachims  confuses the notation and includes the regularizer in the log-Likehood, so be careful)

$\sum_{a=1}^{|S|}\sum_{b\in C_{a}}T_{ab}l(s_{a},s_{b})-\Omega(U,V)$

where

• $C_{a}$ is the collection of states close to $s_{a}$ (i.e  all nearest neighbors,  observed bigams, etc.),
• $T_{ab}$ is the number of transitions from $s_{a}\rightarrow s_{b}$, and is precomputed from the training data, and
• $l(s_{a},s_{b})$ is the dual point, local log-likelihood term for the transition $s_{a}\rightarrow s_{b}$

$l(s_{a},s_{b})=-\Delta_{2}(s_{a},s_{b})^{2}-\log(Z_{2}(s_{a}))$

• $\Omega(U,V)$ is the global regularizer, but which is trivial to evaluate locally (using SGD)

While the optimization problem is clearly non-convex, this decomposition allows us to ‘glue together’ a convex combination  of locally non-convex but hopefully tractable optimizations, leading to what I call a manifold optimization problem.   ( Indeed, from the point of view of mathematical physics, the dual point embedding model has the flavor of a non-trivial fiber bundle,  and the path constraints act to minimize the curvature between the bundles; in other words, a gauge field theory. )   Usually I don’t trust non-convex methods.  Still,  it is claimed good solutions can be found with 100-200 iterations using standard SGD.  We shall see…

#### SGD Update Equations

Joachims suggests solving the SGD by iterating through all states (songs) $s_p$ and updating the exit vectors $U(s_p)$ with

$U(s_p)\leftarrow U(s_p)+\dfrac{\tau}{N}\left[\underset{b\in C_p}{\sum}T_{pb}\dfrac{\partial l(s_p,s_b)}{\partial U(s_p)}+\dfrac{\partial\Omega(U,V)}{\partial U(s_p)}\right]$

and for each $s_p$, updating the entry vector $V(s_q)$ for each possible transition $s_{p}\rightarrow s_{q}$

$V(s_q)\leftarrow V(s_q)+\dfrac{\tau}{N}\left[\underset{b\in C_p}{\sum}T_{pb}\dfrac{\partial l(s_p,s_b)}{\partial V(s_q)}+\dfrac{\partial\Omega(U,V)}{\partial V(s_p)}\right]$

where

• $\tau$ is the SGD learning rate, and
• $N$ is the number of transitions sampled

The partials over the regularizers are simple

$\dfrac{\partial\Omega(U,V)}{\partial U(s_p)}=-2\lambda U(s_{p})-2\nu\vec{\Delta}_{2}(s_{p},s_{p})$

$\dfrac{\partial\Omega(U,V)}{\partial V(s_q)}=2\lambda V(s_{q})-2\nu\vec{\Delta}_{2}(s_{q},s_{q})$

The partial over $U(s_p)$  is written to  samples all possible exit vectors $s_l$

$\dfrac{\partial l(s_p,s_b)}{\partial U(s_p)}=-2\vec{\Delta}_{2}(s_p,s_b)-\dfrac{\sum_{S_l}\exp(-2\Delta_{2}(s_p,s_l)^{2})\vec{\Delta}_{2}(s_p,s_l)}{Z^{r}(s_p)}$

while the partial over $V(s_q)$ is considerably simpler since we only sample the individual transitions (from exit to entry) $s_{p}\rightarrow s_{q}$ for each state, yielding

$\dfrac{\partial l(s_p,s_q)}{\partial V(s_q)}=2\vec{\Delta}_{2}(s_p,s_q)-\dfrac{\exp(-2\Delta_{2}(s_p,s_q)^{2})\vec{\Delta}_{2}(s_p,s_q)}{Z^{r}(s_p)}$

Notice $Z^{r}$ refers to a restricted partition fucntion that only samples states in $C_p$ (i.e nearest neighbors, or what Joachims called the Landmark Heuristic)

Possible Extensions

In a simple Netlfix-style item recommender, we would simply apply some form of matrix factorization (i.e NMF) to $T_{ab}$  directly, and ignore the sequence structure.  Here, it is noted that we have some flexibility in computing  $T_{ab}$ , and we could both Kernalize and Regularize it before running the inference algorithm.

Next Steps

Now that we have a basic review of the model, next we will start making our own personalized playlists.  In our next post, we will look in detail at the LME Software and build some actuals playlists.

Stay tuned

References

See Joachims Website on Playlist Prediction and the references list

# A Ruby DSL Design Pattern for Distributed Computing

Frequently in my work in big data and machine learning, I need to run large calculations in parallel.  There are several great tools for this, including Hadoop, StarCluster,  gnu-parallel, etc.   The ruby world lacks comparable tools although ruby has had distributed computing for a long time:

http://blog.new-bamboo.co.uk/2012/04/11/the-druby-book-distributed-and-parallel-computing-with-ruby-is-finally-out

Having learned ruby while at Aardvark, but not being able to read Japanese , I decided to write my own simple tool, a ruby DSL for distributed computing.   Here I discuss the ruby DSL design pattern and it’s first implementation , for a distributed web crawler and scraper.

As usual, this post is motivated by a question on Quora, “How can we define our own query language?” .

#### DSL Example

Here we show an example of the DSL pattern in action. Imagine we have a master node and several worker nodes, communicating by a queue:

We want to load a block of work onto the master, have it distribute the block to the workers, execute the block in parallel, and cache the results back to the master.

In other languages, like Java or Python, we would need a compiler-compiler like ANTLR.  In Ruby, we can use the magical meta-programming, and the sourcify gem, to quickly build a DSL and execute it remotely.  Lets take a look at a simple example of what our DSL should look like:

#!/usr/bin/env ruby
require 'dsl'

include Dsl

worker.doit do
p "I am doing some work in parallel on several workers"
end
end

The doit &block is the basic method of our DSL.  We want to invoke doit on the master node, but have it executed remotely on the worker node(s).   We do this using the  sourcify gem , which lets us convert the block to source (i.e. Proc#to_source), and ship the source to the worker (i.e via a queue in redis) to be executed by instance_eval(source), or as a singleton method. Note: sourcify is a workaround until ruby-core officially supports Proc#to_source. It is a little buggy…so be careful.

Here is a complete example.  The DSL is broken into 3 parts: Driver, DSL::FrontEnd and DSL::CoreDSL::FrontEnd plays the role of a traditional parser, and DSL::Core is the remote interpreter. Driver and DSL::FrontEnd are executed on the master, and DSL:Core on the workers.

require 'sourcify'

module Dsl

module FrontEnd
def self.included(base)
base.send :include, InstanceMethods
end

module InstanceMethods
def init(opts={}, &block)
@doit_block = nil
yield self if block_given?
end

def doit(&block)
@doit_block = block
self
end

def block_sources
blocks = {}
blocks[:doit_block] = @doit_block.to_source
return blocks
end

end
end

end

DSL::FrontEnd uses the ruby module mixin meta-programming pattern; when Driver includes it, the doit method is added to the Driver instances.

DSL::load creates an instance of a worker  in the do block , which in an instance Driver and therefore now implements the doit method.

Using this pattern, we can create a super simple, extensible DSL that can support a wide range of methods to run on the worker nodes.

#!/usr/bin/env ruby
#
# Copyright (c) 2013 Charles H Martin, PhD
#
#  Calculated Content
#  http://calculatedcontent.com
#  charles@calculatedcontent.com
#
require 'dsl/front_end'
require 'json'

module Dsl

end

class Driver
include FrontEnd

def initialize(opts = {}, &block)
init(opts) #_dsl_front_end
yield self if block_given?
end

# store block_sources.to_json in redis
end

self.new(opts) do |core|
yield core if block_given?
end
end

end

end

When doit is invoked, the &block is converted to source, and then stored in cache (redis). When the worker is run, it grabs the block source (out of redis) and calls perform(source_text), which then evaluates the code block in context of the worker instance using instance_eval.  (Note that in a production system, we may want to compress the source to save memory)

require 'json'
module Dsl

module Core
def self.included(base)
base.send :extend, ClassMethods
# base.send :extend, InstanceMethods
end

module ClassMethods
def perform(source_text)
@doit_block = JSON.parse(source_text)['doit_block']
doit
end

def doit
instance_eval(@doit_block).call
end

end
end

end

Also, to be a bit more efficient, we could also run the worker blocks in a batch mode and evaluate the DSL as a singleton method:

module Dsl
module FastCore
def self.included(base)
base.send :extend, ClassMethods
end

def init(source_text)
block = JSON.parse(source_text)['doit_block']
define_singleton_method(:doit, block)
end

def perform(source_text)
doit
end

end
end
end

Also, notice that because the DSL block is being executed on the worker, it does not carry it’s local context from the master. Variables defined in the do block are not visible inside the doit worker method:

Dsl::load do |worker|
mvar = "I am a variable on the master"
worker.doit do
p "the worker can not see mvar"
end
end

And there we have it; a simple and lightweight way to create a simple ruby DSL for distributed computing.

If you would like to see the design pattern in action, you can can clone and run the open source  Cloud-Crawler .

Related:

http://charlesmartin14.wordpress.com/2013/04/22/cloud-crawler-an-open-source-ruby-dsl-for-crawling-the-web/

# Causality, Correlation, and Brownian Motion

A recent question on Quora asked if machine learning could learn something from the Black Scholes model of Finance

http://www.quora.com/Machine-Learning/Can-we-learn-any-lessons-from-the-Black-Scholes-solution-to-pricing-risk-to-machine-learning-algorithms-for-personalization-recommendation-algorithms

I have been curious about this myself, but from a slightly different perspective, which I share here:

#### Introduction

There is a deep relation between statistical mechanics (stat mech) and machine learning (ML). From Exponential Families to Maximum Entropy to Deep Learning, stat mech is all over ML. Here, we are interested in distinguishing between causality and correlation, and the hope is we can learn from stat mech how to do this. We are motivated by a recent neuroscience paper [3] that suggests a new approach to Granger Causality via non-equilibirum statistical mechanics, specifically the Mori-Zwanzig Projection Operator formalism (see [6], my favorite reference, or just Google it).  Also, very recent results [10] demonstrate that one can apply  the Mori-Zwanzig formalism to a set of coupled stochastic variables like we see in Granger Causality  and other time series models.

This is, however, pretty esoteric stuff, even if you have a PhD in Theoretical Chemistry like me. So before we dive in, in this post, we first look at a something ubiquitous in science: Brownian Motion

#### Causality in Statistical Mechanics

What is causality? Let’s pose it as a problem in time series analysis. Say we want to describe the dynamics of a system where one set of variables $X(t)$ dominate, but are driven, seemingly stochastically, by some other set of variables $Y(t)$.

For example, $X$ might represent a particle floating in a viscous fluid $Y$. As the particle moves, the fluid relaxes around it, leaving a slowly decaying trail.

So even though the particle $X$ appears to move randomly, its position at time $t+\delta t$ depends on both it’s entire history $X(t)$ and the history of the local surroundings $Y(t)$. The more viscous, or stiff, the fluid is, the more the surroundings matter.

When can we say that $Y(t)$ causes $X(t)$ ?

The challenge in statistical mechanics is to decide when the particle is just random (i.e. a Brownian motion) and when the fluid is strongly affecting (i.e causing) a part of the dynamics.

#### Stochastic Problems in Astronomy and Physics

To get started, we need a some basic physics.

There is a classic review of Stochastic Problems in Astronomy and Physics [1] by the late Subrahmanyan ChandrasekharNobel Laureate (Physics, 1983) and former head of the department of physics when I was a student at the University of Chicago.

I am a firm believer that every scientist should study and know the great works of the Nobel Laureates, so I strongly encourage the reader to dig deep [1] on their own–or to at least continue reading.

Chandrasekhar introduces the Langevin equation, and uses it to describe stochastic processes such as Random Walks and Mean-Reverting Walks (which are important for cointegration, but not discussed here)

#### Random Walks

When our particle is behaving randomly, and does not remember its history, we say it is undergoing a Random Walk. This is also known as Brownian Motion. Below we plot the path of several 1-D random walks over time. Notice that they all start at (0,0), but, over time, eventually diverge from each other.

Brownian motion is ubiquitous; it appears everywhere in Science. In Finance and Economics, it is described as a Wiener Process or with the Ito Stochastic Calculus. However, if we go old school, we can also describe the random walk using something familiar from high school calculus/physics. The model is

#### the Langevin equation:

$m\dfrac{d^{2}}{d^{2}t}x(t)=-m\gamma\dfrac{d}{dt}x(t)-R(t)$

where $m$ is the mass, $m\gamma$ is a friction coefficient, and $R(t)$ represents the noise, interpreted as a random force.

What is this? We all know–or have least heard of–Newton’s famous equation

$F=ma$

Force equals mass times acceleration

Here, we flip this around, and have

$ma=F=F_f-F_R$

mass ($m$) times acceleration $\dfrac{d^{2}}{d^{2}t}x(t)$ = Force

= Frictional Force ($F_f$ minus Random Force  ($F_R$)

where

the Frictional Force, $F_f=m\gamma\dfrac{d}{dt}x(t)$,  is a constant times velocity

and

the Random Force $F_R=R(t)$ is defined below

.
The random force represents the interactions of the particle with the solvent, which undergoes the thermal, random fluctuations (shown here)

We can not describe the random force with a dynamical equation, so, instead, we define $R(t)$ through it’s time-time auto-correlation functions

$\langle R(t)R(t)\rangle=0$

$\langle R(t_1)R(t_2)\rangle=2\pi_{R}G\delta(t_1-t_2)$

This correlation function is dot product in a Hilbert space; in fact it is a Kernel.

Here it is a constant; below we define a ‘memory Kernel’ that describes what is causing the random behavior (and also deals with the failures of the theory at short times)

#### Diffusion and the Einstein relation:

The particle $X$ moves around randomly; we say it diffuses. We identify a Diffusion constant, which is just the limit of average mean squared position, taken to very long times

$D=\underset{t\rightarrow\infty}{\lim}\dfrac{1}{2t}\langle (X(t)-X(t_{0}))^{2}\rangle$

It turns out the random force is not just any random force; it is intimately related to the other parameters in the equation. Einstein noted that the friction constant $m\gamma$ is related to the diffusion $D$ constant by

$D= \dfrac{k_{B}T}{m\gamma}$

where $k_{B}$ is Boltzman’s constant, and $T$ is the temperature.

With a little work (not shown here), we can also use the Einstein relation and our definition of the random force to express the Diffusion constant in terms of the time correlation of the velocity $u(t)$

$D=\int_{0}^{\infty}\langle u(t_{0})u(t_{0}+t)\rangle dt$, where

This means that if the velocities are weakly correlated (over time), the surroundings drag on the particle, and it diffuses slower. On the other hand, when the velocities are strongly correlated, the particle does not feel the surroundings, and it diffuses faster.

Because the velocity platys such an important role, frequently we express the Langevin equation in terms of the velocity

$m\dfrac{d}{dt}u(t)=-m\gamma u(t)-R(t)$

We can also express the velocity-velocity auto-correlation function as a simple exponential decay

$\langle u(t_{0})u(t_{0}+t)\rangle=\langle u^2\rangle\exp(-\gamma|t|)$

where

$\langle u^2\rangle=\dfrac{D}{\gamma}$

Not that our theory breaks down at short time scales because we expect the the velocity process to be stationary process. That is, we expect

$\langle u(t)\dfrac{d}{dt}u(t) \rangle=0$

which is clearly not satisified.

On long timescales, the Langevin equation describes a mathematical Brownian motion, but on small timescales, the Langevin equation includes inertial effects that are not present in the Brownian description. These inertial affects can be very real, as shown above in the image of the pollen in the viscous fluid, and are corrected for below.

Also, we are considering systems that actually have velocities–this is important since we don’t usually think of pure stochastic processes, such as economic or financial time series, as having a well defined or instantaneous velocity.

Still, we are on the right track. We can relate the random forces directly to the macroscopic diffusion $D_{u}$ (in velocity space) through the correlation function of the random forces

$D_{u}\sim\int_{0}^{\infty}\langle R(t_{0})R(t_{0}+t)\rangle dt$

So we can determine how strongly the environment ’causes’ the dynamics by measuring an auto-correlation function of the random forces

We begin to see the link between correlation functions and causality. This is a direct example of

the Fluctuation-Dissipation theorem

The fluctuation-dissipation theorem is a general result of statistical mechanics that quantifies the relation between the fluctuations in a system at equilibrium and its response to external perturbations. Basic examples include Brownian motion and Johnson–Nyqvist noise, but this phenomena also arise in non-equilibrium systems, and, perhaps, even in the neocortex.

Here, the same random forces that cause the erratic motion of a particle also causes drag. If something pulls the particle through the fluid, it feels this drag. The particle’s random motions and dissipative frictional forces have the same origin, or cause.

We can see this in 2 different ways:

1. We can apply an external force to the system and monitor the (linear) response
2. We can construct equilibrium velocity distribution and relate this to the self-diffusion

More generally, we can observe any external action $Y$  that “causes” $X$ to move by inferring the ‘equilibrium’ distribution, and  computing the appropriate correlation function of the random forces; this is our link to causality.

What does this have to do with Machine Learning? Notice I said infer. For a general system, we will infer the so-called ‘equilibrium’ , or most likely, (Boltzman) distribution, using a variational method. Then we can directly evaluate the correlation functions (which are the partial derivatives of our calculated partition function). But first, we need a more robust Langevin equation to work with.

#### the Generalized Langevin Equation (GLE)

We can generalize the Langevin equation by redefining the Random Force through a different Kernel. Let

$\langle R(t_1)R(t_2)\rangle=\xi(t_1-t_2)$

where $\xi(t)$ is called the memory Kernel.

This gives rise to a Generalized Langevin Equation (GLE)

$\dfrac{d}{dt}u(t)=-\int_{t_0}^{t}\xi(t-t')u(t')dt+R(t)$

A typical application of the GLE in chemical physics to extend Brownian to describe the dynamics of a particle in a viscoelastic fluid (a fluid that is both thick and deformable); the memory Kernel gives rise to a generalized frictional force.

Some examples:

When memory effects matter, we have causality.  Memory effects an appear in the velocity auto-correlation function (VACF) either through the presence of an oscillatory behavior or by means of slowly decreasing correlations [11].   In liquids, oscillations appear, for example, when particles are trapped, or caged, by their surroundings.  And correlations slowly decrease in cases of anomalous diffusion, such as liquid-like defects [12].  Indeed, these effects arise all over physics, and, we suspect, in other systems as well.   Only very recently, however, has the Mori-Zwanzig formalism been recognized as useful as a general mathematical tool for both optimal prediction [13],  Markov Models [14], and Neuroscience [3]

#### Towards a Generalized Granger Causality

In our next post, we will apply the GLE to problems in general machine learning. First, we will introduce a more abstract form of the GLE. We will then derive this GLE using the Mori-Zwanzig Projection Operator formalism from statistical mechanics.

Then, we will show how to infer the model parameters using variational techniques. Finally, we will demonstrate the relationship between Granger causality and a new measure that, we hope, will be more suitable in noisy, non-linear regimes. Please stay tuned.

#### References

[1] Engle, Robert F., Granger, Clive W. J. (1987) “Co-integration and error correction: Representation, estimation and testing“, Econometrica, 55(2), 251–276.

[3] D.Hsu and M. Hsu (2009) Zwanzig-Mori projection operators and EEG dynamics: deriving a simple equation of motion

[4] Peter Hänggi, Fabio Marchesonia! Introduction: 100 years of Brownian motion

[6] J.P. Hansen and I. R. McDonald (1986) , Theory of Simple Liquids

[7] http://www.dam.brown.edu/documents/Stinis_talk1.pdf

[8] Kubo, the Fluctuation-Dissipation Theorem

[10] Jianhua Xing and K. S. Kim , Application of the projection operator formalism to non-Hamiltonian dynamics J. Chem. Phys. 134, 044132 (2011)

[12] C. H. Martin and S. J. Singer, “The behavior of point defects in a model crystal near melting”,Phys. Rev. B44: 477 (1991)

[13] Alexandre J. Chorin*, Ole H. Hald*, and Raz Kupferman, Optimal prediction and the MoriZwanzig representation of irreversible processes, PNAS (1999)

[14a] C. L. Beck, S. Lall§, T. Liang and M. West, Model reduction, optimal prediction, and the Mori-Zwanzig representation of Markov chains
[
14b] Madhusudana Shashanka, A FAST ALGORITHM FOR DISCRETE HMM TRAINING USING OBSERVED TRANSITIONS

Bring the Noise