# Machine Learning with Missing Labels: Transductive SVMs

SVMs are great for building text classifiers–if you have a set of very high quality, labeled documents.

In many real world problems, we just don’t have enough labelled data. What can we do?

• We can crawl the web for more labelled data. For this purpose, here at Calculation Consulting we built a production quality crawler
• Pay mechanical Turks to label the documents.  This is actually harder than it sounds.
• We can use what labels we have, and try to guess the labels of the unlabeled documents.

Guessing labels is called Transductive Learning and is the topic of this post.

### SVMs and Incremental Training

We have a huge collection of documents, perhaps with just a few labels.  The obvious, or perhaps naive, thing to do is build a classifier using what we have, and then try to scale it out incrementally.  That is, we select documents labeled that are labelled high confidence by the smaller model and use them to build a training data set for a larger SVM model:

So this seems pretty simple — what could go wrong? Since the initial model has very little labeled data, it is going to make lots of mistakes, even in the high confidence regime.   We are adding in some data that is mis-labeled.    As we incrementally retrain,  we amplify these errors, thereby reducing overall accuracy.  So what can be done?

### Transductive Learning

Suppose that, we we add the new data in, instead of just naively using our smaller model to label the new data, we could automatically somehow switch the labels on these new documents, and then pick the best model from all these possibilities.    That is, we run an SVM where we retain the labels on the labeled (blue) documents, but we let the labels on the test (purple) documents float.  We constrain the problem so that at least $r$ unlabeled documents are labeled positive.  We then pick the model with the largest margin. This leads to the following non-convex optimization problem for the Transductive SVM (TSVM)

$\min\dfrac{\lambda}{2}\|w\|^{2}+\dfrac{1}{2L}\sum\limits_{i=1}^{L}l(y_{i},w^{T}x_{i})+\dfrac{\lambda'}{2U}\sum\limits_{j=1}^{U}l(y_{j},w^{T}x_{j})$

subject to the constraint $\dfrac{1}{U}\sum\limits_{j=1}^{U}\max[0,sign(w^{T}x_{j}')]=r$

where the $l$ is the loss function (such the SVM hinge loss or, as here, the L2_SVM loss).  There are $L$ labelled documents $(x_{i})$ and $U$ unlabeled documents $(x_{j})$, and $\lambda,\lambda'$ are adjustable parameters.  The constraint $r$ fixes the number of positively labeled documents; this constraint I worry about the most but lets just try it anyway.

#### Why does Adding Test Data to the Training Set Improve the Accuracy?

When we don’t have enough training data, the SVM optimization problem is still convex, but the optimal solution may be spurious.  Imagine that we are trying to train a binary text classifier with only say 100 documents in each class.  The classifier will learn how to separate the documents, but it will not be able to generalize to new documents because it simply has not seen enough examples or really enough words to build a usable feature space.    The solutions may be spurios.

Indeed, there may exists many margins that separate the data, and the largest may not be the one that generalized best.  That is, without enough training data, the SVM may mis-learn and think that noise words are driving the  classification.

We want is a classifier that generalizes to unseen test documents;

if we don’t have test labels, perhaps we can guess them.

When we add unlabeled documents , or test data (circles) , to the classifier training data.  If we can guess the labels of the circles correctly, we can find the optimal solution because it is less ambiguous:

Notice that since we are seeking a maximum margin solution, we implicitly are assuming that the documents form clusters in feature space because the margin lives in the low density regime between document clusters.  We are also are increasing the size of feature space, thereby greatly improving the generalizabilty of the model.

### TSVM Algorithm: Deterministic Annealing

Unlike a standard (inductive) SVM, the TSVM optimization problem is highly non-convex and quite difficult to solve exactly.  Clearly we can not switch the labels on all the new documents; if we add in N documents, then we have $2^{N}$ combinations to evaluate.  So how do we solve it?

There are 2 well known approaches to solving the TSVM–both of which are readily available in the open source package SvmLin:

• Label Switching Heuristics  (-A 2) , and
• Deterministic Annealing (-A 3)

We will discuss the Deterministic Annealing algorithm because it is very nice example of how to combine the ideas of SVMs with the techniques of Statistical Mechanics discussed in our last post.   Other, more modern Transductive learning algos may use alternate methods such as convex-concave / difference-convex programming [8], [9].  Many, many other approaches for non-convex optimization also exist.

We do note, however, that in practice, I have not a great difference between the -A 2 and -A 3 options, and, in fact, sometimes the -A 2 option is more stable.  This is constant with the known research.  Also, in the original paper [3], TSVM  refers to option  -A 2, and DA to -A 3.  Here, we use the term TSVM to mean any implementation.

#### Combining SVMs with Statistical Mechanics

We previously, discussed basic Statistical Mechanics and Simulated Annealing, where we derive the concept of Entropy.

To solve the  TSVM / DA optimization problem, we need relax the constraint on the labels that they be integers, and, instead let them represent the probability of being in the (+) or (-) class.  We then introduce a temperature $T$ that First, lets get some labels that range from 0 to 1:

$\mu_{j}=\dfrac{1+y_{j}}{2}$ so that

$\mu_{j}=1$ when $y_{j}=1$,  and

$\mu_{j}=0$ when $y_{j}=-1$

This lets us re-write the TSVM optimization problem as

$\min\dfrac{\lambda}{2}\|w\|^{2}+\dfrac{1}{2L}\sum\limits_{i=1}^{L}l(y_{i},w^{T}x_{i})+\dfrac{\lambda'}{2U}\sum\limits_{j=1}^{U}(\mu_{j}l_{2}(y_{j},w^{T}x_{j})+(1-\mu_{j})l_{2}(y_{j},-w^{T}x_{j}))$

where we write $l_{2}$ to specifically indicate that the loss function is the L2-SVM loss.

We now take a trick from the Ising model in Statistical Mechanics.  We want to transform the labels

$\mu_{j}\rightarrow p_{j}$

into probabilities that range from 0 to 1:  $p_{j}\in[1,0]$

and are normalized to the total number of expected positively labeled instances: $\dfrac{1}{U}\sum_{j=1}^{u}p_{j}=r$

How do we incorporate probabilities into our SVM optimization problem?  We are already selecting the maximum margin of the training data; we now need to also select the optimal probabilities — the maximum Entropy — of the labels for the unlabeled data.

We extend the TSVM max-margin problem, adding a term that maximizes the total Entropy  $S$ of the possible configurations of the unlabeled test data

$S=-\sum_{j=1}^{u}p_{j}\ln(p_{j})+(1-p_{j})\ln(1-p_{j})$

Actually, since we use the minimize $\|w\|^{2}$, we  actually minimize the negative Entropy $-S$.  This leads to a new TSVM / DA optimization problem

$\min\dfrac{\lambda}{2}\|w\|^{2}+\dfrac{1}{2L}\sum\limits_{i=1}^{L}l(y_{i},w^{T}x_{i})+\dfrac{\lambda'}{2U}\sum\limits_{j=1}^{U}(\mu_{j}l_{2}(y_{j},w^{T}x_{j})+(1-\mu_{j})l_{2}(y_{j},-w^{T}x_{j}))+\dfrac{T}{2U}\sum_{j=1}^{u}p_{j}\ln(p_{j})+(1-p_{j})\ln(1-p_{j})$

where the additional parameter  $T$ is a Lagrange Multiplier for Entropy $S$ .  We also have to add in the normalization (and class balancing) constraint

$\dfrac{1}{U}\sum_{j=1}^{u}p_{j}=r$

Handing this constraint is a bit more complicated, and we refer to [3];  we do note that the constraint can be re-written into what looks like a Fermi energy, which is quite interesting.

We maximize the Margin on the training (and test) sets,

and maximize the Entropy on the test set.

Solving this optimization problem requires alternating between solving the labeled SVM optimization problem (maximizing the margin $w$) and finding the optimal probabilities for the unlabeled instances (optimizing $p_{j}$).  The result are labels (with confidences) for our previously unlabeled data, and, if we want to use it, a new SVM predictive model.

We also need to be careful not to overtrain.  So I will select out the newly labeled data with the highest confidence labels and then use this as training data for the production SVM (i.e LibLinear).  This would allow the data scientist to then take a label set of say 100 documents per class, and automatically extend it to say 10,000 labeled documents per class without the use of mechanical Turks or crawling the web.

Despite the wonderful advantages promised by Transductive (and SemiSupervised) Learning, it is quite hard to apply in practice and has not achieved even close to the popularity of its Supervised and Unsupervised counterparts.  TSVMs need both proper tuning [8] and well designed data sets.  We need to understand it’s practical implementation much better to use it.

#### Practical Issues:  Selecting the best labelled data

An initial training data set may be naively separable if it is small and has a large number of features :

naively separable:

$N_{instances}\ll N_{features}$

As a data set grows, we may have far more instances than features:

potentially unseparable:

$N_{instances}\gg N_{features}$

which makes it far more probable that some instances will lie in the slack region between classes.

We usually run a tool like LibLinear , which implements either a soft-margin SVM or L2-regularized Logistic Regression.  Indeed, in practice is usually impossible to tell the difference between these 2 models.  The additional slack terms can be though of as an extended regularizer or as modifying the SVM loss function.  Indeed, even Vapnik himself has argued that it can be very beneficial to run an SVM with a large number of adjustable slack variables (i.e in master learning and/or his universum model [5]). The SvmLin implementation of the TSVM is a hard margin (binary, no slack variables) SVM, but it does use the L2-SVM loss function (and not the harder hinge loss).

It is suggested to run the TSVM slowly and repeatedly to relabel documents, and, on each iteration, to only take new labels that have moderately high confidence (and therefore, hopefully, outside the slack region)

#### Practical Issues:  designing multiple 1-vs-all  data sets

Additionally, SvmLin only implements a binary SVM, whereas many practical applications require a multi-class SVM.  More generally, one might try to extend the TSVM to a DA type algorithm using say a Potts model–a classic model in physics that generalizes an Ising model…but for now we have to use SvmLin to get practical solutions.

In a normal SVM, balancing the class data is important; in a Transducitve-SVM, it is absolutely critical.  For example, this recent study on TSVMs discussed the issue.  This can be quite challenging in real world, production environments, and if you want to apply this powerful technique, extra care should be taken to design the TSVM (training + test) sets.

This means, in order to apply SvmLin, it is necessary to carefully construct  a collection of 1-vs-1 or 1-vs-all binary SVMs.  Each individual TSVM classifier should be designed to extend and correct the label set of a given set of instances or documents.

For example, suppose we have a  5 class multiclass model, but we only have a few labeled instances for each class.  If we try to extend the set of instances for class 1, we need to select either four data sets,  for four 1-vs-1 binary TSVMs, or one large data set for a 1-vs-all binary TSVM. If we choose the 1-vs-all approach, then we have to make sure the positive and negative sets are balanced in a way that the final  labels are better than the initial–and this is not easy.  The fraction $r$  is a normalization constraint, and keeping it fixed means the negative set must be carefully constructed to ensure that final result has $r=1/2$ positively labeled instances that represent the actual positive class well.

The TSVM can help add labeled data to a well designed model; it can not be used to repair a poorly designed classification model .

Cnsider a binary SVM trained on a small training set of cat and dog videos.   We then apply the model to some unseen test videos, and select a large number of videos that were labelled either cats or dogs.  The model will certainly make mistakes, but  we can hope that the errors are balanced across both classes.  So when we apply the TSVM to the entire training + test  set, half of the data will still actually be dog videos.

On the other hand, let’s consider the poorly designed classifier shown below.  Here, we have 3 classes:  dogs, cats, and animals.  The third class is trained for animal videos, and may contain dogs, cats, horses, mice, etc.  This is a bad design, and we would see this, say, in the confusion matrix for the entire model.  When we create the increment set, to train the TSVM, we have no idea what fraction of the training + test set are actually dog videos.  So we can not really fix the normalization constraint $r$, and resulting TSVM model may behave screwy.

#### Epilogue

Transductive Learning was first proposed by more than thirty years ago by Vapnik and Chervonenkis [6] , and it is a critical idea in the development of the famous VC Theory of Statistical Learning [1].  In a future post, we will look in detail at the role of Transductive Inference in the VC theory and its modern form using the Rademacher Complexity [6].

Practically,  there are few useful Transductive and Semi-Supervised ML algos to use :

In our next post,  we will run some experiments with the SvmLin code and see how well it works on some real world datasets.

To get involved, goto    https://github.com/CalculatedContent/tsvm

#### References

[1]  Vladimir Vapnik. Statistical learning theory. Wiley, 1998. ISBN 978-0-471-03003-4.

[2] T. Joachims, Transductive Inference for Text Classiﬁcation using Support Vector Machines, ICML 1998

[3] V. Sindhwani, S. Sathiya Keerthi, Large Scale Semi-supervised Linear SVMs, 2006

[4] O. Chapelle, B. Scholkopf, A. Zien, Semi-Supervised Learning, 2006

[5]  http://www.cs.princeton.edu/courses/archive/spring13/cos511/handouts/vapnik-slides.pdf

[6] Vapnik, V., & Chervonenkis, A.  The Theory of Pattern Recognition, Moscow 1974

[7]  Ran El-Yaniv, Dmitry Pechyony, Transductive Rademacher Complexity and its Applications

[8] J. Wang,  X. Shen,  W. Pan On Transductive Support Vector Machines

# Metric Learning: Some Quantum Statistical Mechanics

I wrote this for my buddy Sebass; just a quick review of quantum stat mech:

In our last post, we presented a formalism for music recommendations called Logistic Markov Embedding (LME).  This  technique uses some mathematics that arises in other modern machine learning methods, such as Maximum Entropy  (MaxEnt) methods, and the Restricted Boltzmann Machines (RMBs) in Deep Learning.

In this post, we are going to take a step back and  understand where the idea for a metric embedding comes from and the mathematical formalism employed by Joachims. The Metric Embedding in MLE is a kind of Energy model  — an idea from Quantum Statistical Mechanics.So we will review the pedagogic formalism of  Statistical  Mechanics for Thermodynamics.

We will follow the classic text by T. Hill, and one of the first books I ever read in college.  Also, we will not look at more abstract version of Quantum Statistical Mechanics because we don’t need to deal with the phase space formulation until we look at time-dependent processes like Causality (but eventually we will get there).  For a deeper view of this subject, we defer to the classic text by Jaynes, Probability Theory: The Logic of Science.

# Probabilities and the Partition Functions

Frequently we see expressions for probabilities, as in the LME approach,  written as:

$Pr(p)=\underset{i}\Pi\dfrac{\exp(-\Delta(p^{i},p^{i-1})^{2})}{Z(p^{i-1})}$

where $Z(p^{i})$ is partition function, given by

$Z(p^{i})=\sum_{j=1}^{|S|}\exp(-\Delta(p^{i},s_{j})^{2})$

Naively, we can just think that we are defining the probability in an abstract way, and that $Z(p^{i})$ is just the normalization.  But there’s more to it.  In fact, it is deeply connected  the Boltzman / Gibbs Distribution

So why in the world do we do this?

#### Convex and non-Convex Optimization

In simpler machine learning algos like SVMs and Logistic Regression, we can readily solve the optimization problems  using Stochastic Gradient Descent (SGD)– because these are convex optimization problems.

Below we see two optimization surfaces: our goal is to find the global minimum.   The top problem is easy to solve, and just about any decent descent method will do (unless we need sparse solutions, which makes this a little trickier).  The bottom problem, however, is filled with hills and bumps, and even looking at it, we are not sure where the bottom is.

So how do we find it?  Typical solvers for non-convex problems include Genetic Algorithms, Convex-Concave/DC programming,  BackProp (for ANNs), and Simulated Annealing.

In Simulated Annealing, we run the solver until it gets stuck.  We then  heat the system, making it jump over local humps. We then cool it down–we run the solver again until it stops.  We repeat this over and over until we find the global minimum .  A common application is in the simulation of liquids and crystals. To do this, we need to define a temperature.

#### Adding Temperature to Probabilistic Modeling

In traditional probability theory, we require the individual probabilities sum to 1:

$\sum P_{1}=1$

Of course this is always true, but we don’t have to enforce this all the time; we can relax this constraint and simply enforce it at the end of the problem.  This is exactly what we do in Simulate Annealing. Instead of creating probability models, we create Energy Models.  The energies are essentially the log of unnormalized probabilities

$E_{j}\sim\log(P_j)$

Which is what we see in LME.  We only require  the total energy $E_t$ to be constant–not 1.  This means when we compute $P_j$

$P_{j}=\dfrac{\exp(-\beta E_j)}{Z}$

we need the normalization factor $Z$

$Z=\sum \exp(-\beta E_i)$

#### Modeling with Energies

Lets say we have a bunch of objects that we would like to distribute (or cluster) into  boxes, sets, or sequences in some optimal way.  We might be placing balls into boxes, or we could be assigning documents into clusters, or we are finding personalized playlists.  We want to find the optimal way of doing this through some constrained optimization problem.   Imagine we have a container partitioned into small, disjoint boxes or sets $j$.  We have

1.  the total number of objects $latex \mathcal{N}$ is fixed. For us, the total number songs.
2.  the total ‘Volume’  V is fixed.  This will come up later when we create the metric embedding

We wanted to find the optimal probability of a song to appear in a playlist.  Here, more generally, we seek the probability $P_{j}$, of putting an object in a box $j$, subject to our prior knowledge of the problem.  In other approaches, we might define a  conditional probability for each box.  Here, we say each box has an energy $E_j$ . We require total Energy $E_{t}$ of the system is always fixed. We can place  $n_{j}$ objects into each box $j$. We only require

$\sum_{j}n_{j}=\mathcal{N}$

Each combination, or set of numbers,  ${n_1,n_2\cdots}$ constitutes a distribution $\Omega_{t}(n)$.  There are many ways of doing this;  total number of combinations is given by the multinomial coefficients

$\Omega_{t}(n)=\dfrac{\left(n_{1}+n_{2}\cdots\right)!}{n_{1}!n_{2}!\cdots}$

$\Omega_{t}(n)=\dfrac{\mathcal{N}!}{\prod_{j}n_{j}!}$

Each combination has a specific configuration of energies $E_{j}$.  For example, if we place 4 objects A, B, C, and D into boxes 2, 3, 2, and 1, respectively the total energy of this configuration (or micro-state) is given by

$E_{1}+2E_{2}+E_{3}=E_{t}$

Of course, this configuration is just like a playlist, except we have not accounted for the ordering explicitly here.  We can consider other configurations as well

but they all have the same total energy $E_t$.  This leads to the constraints

$\sum_{j}n_{j}E_{j}=E_{t}$

The probabilities of being in each box $P_j$ is easy computed for small $latex \mathcal{N}$

$P_{j}=\dfrac{\bar{n_{j}}}{\mathcal{N}!}$

where $\bar{n_{j}}$ is the average number of objects in each box.   And we can compute the average energy $\bar{E}$

$\bar{E}=\sum_{j}P_{j}E_{j}$

We want to find the most likely distribution $\Omega(n*)$ of objects, subject to the constraints on the distribution. But as soon as N gets large, we can not just evaluate these sums because they are too large.  But we note that the most likely distribution $\Omega(n*)$ is very highly peaked around   $n*$

as is the most likely average Energy

Which means that we can approximate the most likely distribution by trying to find the minimum value of $\Omega(n*)$.  Since this distribution is so peaked, we can approximate this very well by minimizing the log of the distribution

$\min\log(\Omega(n*))$ subject to our constraints

$\sum_{j}n_{j}=\mathcal{N}$ and $\sum_{j}n_{j}E_{j}=E_{t}$.

We can solve this optimization using the method of Lagrange multiplies, as we do in many convex optimization problems in ML.  This yields

$\dfrac{\partial}{\partial n_{j}}\left[\ln\Omega_{t}(n)-\alpha\sum_{i}n_{i}-\beta\sum_{i}n_{i}E_{i}\right]=0$

where $\alpha,\beta$ are the Lagrange multiplies (and which usually appear as adjustable parameters in other ML problems). We want to solve this problem in the limit $\lim\mathcal{N}\rightarrow\infty$.  We take advantage of

##### Stirling’s Approximation

$\ln\mathcal{N}!\cong\mathcal{N}\ln\mathcal{N}-\mathcal{N}$

to write

$\ln\Omega_{t}(n)=\left(\sum_{i}n_{i}\right)\left(\sum_{i}n_{i}\ln n_{i}\right)-\left(\sum_{i}n_{i}\ln n_{i}\right)$

Now we can evaluate the derivatives explicitly

$\ln\left(\sum_{i}n_{i}\right)-\ln\left(n_{i}^{*}\right)-\alpha-\beta E_{j}=0$

With this simple expression, we can find the optimized, average number of objects in each box

$n_{j}^{*}=\mathcal{N}e^{-\alpha}e^{-\beta E_{j}}$

where

$e^{\alpha}=\sum_{j}e^{-\beta E_{j}}$

and the optimal probability of being in each box

$P_{j}=\dfrac{n_{j}^{*}}{\mathcal{N}!}=\dfrac{e^{-\beta E_{j}}}{\sum_{j}e^{-\beta E_{j}}}$

which is known as as the Boltzman Distribution, or , more generally, the Gibbs Measure.

The average energy of the system, which is a kind of optimal normalization for the problem, is given by

$\bar{E}=\dfrac{\sum_{j}E_{j}e^{-\beta E_{j}}}{\sum_{j}e^{-\beta E_{j}}}$

We now recognize that the energies are log probabilities

$E_{j}\sim\log(P_j)$

and we call the normalization factor the Partition Function $Z$

$Z=\sum_{j}e^{-\beta E_{j}}$

We call $\beta$ the inverse temperature, because   $\beta\sim\dfrac{1}{T}$ in traditional thermodynamics.

We have completed out mission; we have introduced the notion of a Temperature into our probabilistic models.

#### Using the Temperature:  Simulated Annealing

We should not use a formalism unless we intent to take advantage of it.  Joachim’s uses the LME formalism to create a complicated model, but he only uses SGD to solve it.  We can do better than this.  Indeed, for Deep Learning, we will have to.

Of course, it is a bit more technical than this.  And this is not the only way to solve the problem. Plus  we still need Regularization to avoid overfitting, and obtain sparse solutions.   We will live this to another post.

Still, we see that the language of Statistical Physics offers many advantages over traditional probabilistic models–and this why it is so popular in machine learning.  In our next post we show how to combine the Entropy with an SVM to provide a very powerful, hybrid, semi-supervised learning method called a Transductive SVM.

# Music Recommendations and the Logistic Metric Embedding

In this post, we are going to see  how to build our own music recommender, using the Logistic Metric Embedding (LME) model developed by Joachims (of SVMLight fame). The core idea of this recommender is that people listen to songs in a specific sequence, and that certain songs sound better when they follow other songs.

We will break this post into 2 parts:

• First, we will summarize the key ideas of the mathematical formulation of the LME.
• Second, we will actually run the LME software and see how to build our own music playlists.

But first, a little history

Introduction:  Some History of Sequence Prediction

I first began working in sequence prediction in about 2001, where I developed the Effective Operator approach to Personalized Recommenders.  This was a recommender system that used a kind of Bayesian / Regularized Linear Regression.  Some years later, working with eBay, we researched the effectiveness of search relevance (a kind of sequential recommendation) using techniques including Joachims’ earlier work on Structural SVMs.  Recent work has focussed on built upon the  Structural SVMs for Ranking, leading to  a new, simpler  approach, Metric Learning to Rank.   Metric Embedding is, in general, a good thing to know about, and you can learn about it more generally from the University of Chicago course: CMCS 39600: Theory of Metric Embeddings.  Joachims has subsequently also considered metric learning, and here, we examine some his recent research in metric learning for sequence prediction.  Specifically, we look in detail at the Logistic Metric Embedding (LME) model for predicting music playlists.

Sequence Prediction with Local Metric Embeddings

We would  like to recommend playlist of songs.  More generally, we seek to estimate the probability of observing a new, directed sequence $D$, given a training set of a very large set of directed sequences  $\mathcal{S}^{d}_{train}$.  To do this, we define a conditional, probabilistic, vector space model  that retains the local sequence structure of the sample data, and we try to learn the optimal model via regularized maximum-likelihood estimation (MLE) .

Developed by Joachims (and following his notation), we seek to predict a (directed) sequence of states $p=\{p^{1},p^{2}\cdots p^{d}\}$ (i.e. an ngram sequence, a playlist of songs, etc. ) from a large collection of states $S=\{s_{1},s_{2}\cdots\}$, where $p^{i}\in S$.   We model each sequence $p$ as a Markov Chain, so that

$Pr(p)=\underset{i}{\Pi}Pr(p^{i}|p^{i-1})$

We embed the sequence into the Latent space, or the d-fold Tensor space  (which is isomorphic to a $(|S|\times d)$ dimensional Euclidean space )

$\mathcal{S}^{d}=S\times S\times\cdots S\cong\mathbb{R}^{|S|\times d}$, such that  $p\in\mathcal{S}^{d}$

The inference problem is to find $Pr(p)$ ; it is  solved using a straightforward Stochastic Gradient Descent (SGD) algorithm, restricted to a kind of nearest-neighbor sampling called a Landmark Heuristic

Because we have a Markov Model, we will see that all we need is to write down the local log-Likelihood, defined through the conditional transition probabilities $Pr(p^{i}|p^{i-1})$  , and, as usual, some regularization.  Joachims ignores long range interactions and only regularize locally, and I will note some opportunities to address this.

Also, we note that assuming a Markov Model is actually restrictive (Recall our post on Brownian motion and memory effects in chemical physics) .  First, it assumes that to predict the next song, the only songs that matter are the ones we listened to before this song.  Additionally, Markov Models can ignore long-range interdependencies.   A NMF based recommender does not assume this, and they include long-range effects.

Recent work (2013) shows how to parallelize the method on distributed memory architectures, although we will do all of our work on a shared memory architecture. they even provide sample code!

#### Visualizing the Embedding

For example, a typical embedding is visualized as:

The Online LME Demo provides an interactive view of the embedding.  It also includes simple modifications for adding popularity, user bias, and even semantic tags to the model (not discussed here)

#### Single and Dual Point Logistic Metric Embeddings

Joachims introduces 2 different models, called the Single Point and Dual Point models.  We  should only care about the Dual Point model, although, curiously, in the published paper (2012), the Dual Point model does not work better than the Single Point model.  So, for now, we describe both. Also, the code contains both models.

By a Metric Embedding, we mean that the conditional probability is $Pr(p^{i}|p^{i-1})$ is defined with the metric, or distance $\Delta(p^{i},p^{i-1})$, in the Latent space $\mathcal{S}^{d}\cong\mathbb{R}^{|S|\times d}$ where we have embedded our states (songs).

$Pr(p^{i}|p^{i-1})\sim\exp(-\Delta(p^{i},p^{i-1})^{2})$

What is the Embedding?

We associate each state $s_{i}$  with a vector $\vec{X}(i)\in\mathbb{R}^{|S|}$

$s_{i}\rightarrow\vec{X}(i)$

Notice Joachims writes $\vec{X}(p^{i})$ and $\vec{X}(s_{i})$ to distinguish between vectors associated with elements of a playlist and arbitrary elements of $S$.

In a model for , say, playlists, each vector  it is , initially, just a point $\vec{X_{0}}(i)$ on a $|S|$-dimensional hypercube.  We then run an inference algorithm which learns the optimal vectors $\vec{X_{opt}}(i)$ that minimize the total regularized log-likelihood $L$, as computed over the entire training data, and subject to the specific model chosen.

What is $\Delta$ ?

In the Single Point model, $\Delta_{1}$ is just the Euclidean distance between  $X(i)$ and $X( i-1)$

$\Delta_{1}(p^{i},p^{i-1})=\parallel\vec{X}(i)-\vec{X}(i-1)\parallel_{2}$

This is just a standard Vector Space model.

In the Dual Point model, however, we associate 2 points (vectors)  $\left(\vec{U}(i),\vec{V}(i)\right)\in\left(\mathbb{R}^{|S|},\mathbb{R}^{|S|}\right)$, with each state $s^{i}$ within a  sequence $s$.  We call these the entry $\vec{U}(i)$ and exit $\vec{V}(i)$ points.

$s_{i}\rightarrow\left(\vec{U}(i),\vec{V}(i)\right)$

The distance between 2 states the asymmetric divergence

$\Delta_{2}(p^{i},p^{i-1})=\parallel\vec{U}(i)-\vec{V}(i-1)\parallel_{2}$

This is like a Vector Field Model, where each state is mapped to a directed vector

$\vec{\Delta}(i)=\vec{V}(i)-\vec{U}(i)\in S$.

(In mathematical physics, the embedding is a called a Vector Field or, more generally, a Fiber Bundle ).

The Latent metric is the Euclidean distance between the end point of $p^{i-1}$ and the starting point of $p^{i}$ within a directed sequence $p$.

We can also construct sequence-independent ($\bar{U}(i),\bar{V}(i)$)  vectors by simple averaging over a large sample of known sequences (i.e. song playlists).

$i\rightarrow\left(\bar{U}(i),\bar{V}(i)\right)$

#### Model Inference

We want to find

$Pr(p)=\underset{i}\Pi\dfrac{\exp(-\Delta(p^{i},p^{i-1})^{2})}{Z(p^{i-1})}$

where  $Z(p^{i})$ is the point partition function, given by

$Z(p^{i})=\sum_{j=1}^{|S|}\exp(-\Delta(p^{i},s_{j})^{2})$

This is just the normalization of the conditional probability (above).   Note that  the  sum is over all possible states $s_{j}\in S$, and for numerical performance, we will restrict the sum to nearby states.

Actually, the final model used is somewhat more complicated because it includes other features to adjust for popularity, user bias, diversity, etc.

#### Final LME Model

$Pr(p)=\underset{i}{\Pi}Pr(p^{i}|p^{i-1})=\underset{i}\Pi\dfrac{\exp(-\alpha\Delta(p^{i},p^{i-1})^{2})+\beta b+\gamma\Delta(p^{i},p^{0})^{2}}{Z(p^{i-1},p^{0},\alpha,\beta,\gamma)}$

The optimal solution is found by maximizing the  log-Likelihood $\log(Pr(p))$, subject to regularization to avoid over-training.

(in a later post, I will explain the partition function ad where it comes from;  seriously, I have most of this written up already)

For the Single Point Model, the objective function is the log-Likelihood plus a standard 2-norm regularizer

$f_{1}(\bf{x})=\underset{U,V}{\sum}L_{1}(D|X)-\lambda\parallel\bf{X}\parallel^{2}_{F}$

For the Dual Point Model, the objective function includes 2 kinds of regularizers

$f_{2}(\bf{x})=\underset{U,V}{\sum}L_{2}(D|U,V)-\lambda\Omega(U,V)-\nu\underset{s\in S}{\sum}\Delta_{2}(s,s)_{2}$

The first regularizer is

$\bf{\Omega(U,V)}=\parallel\bf{U}\parallel^{2}_{F}-\parallel\bf{V}\parallel^{2}_{F}$

and ensures that the entry and exit vectors don’t get too large.  It is similar to the kind seen in traditional matrix factorization techniques.

The second regularizer ensures the small length sequences, and is similar in spirit to the kind of path regularizers seen  in Minimum Path Basis Pursuit methods (where, ideally, we would seek use an L1-norm for this).  This part is critical because it allows to decompose what looks like a large, messy, non-convex, global optimization problem into a convex sum of local, albeit non-convex , optimization problems:

A Local Manifold Optimization Problem

Here is the real trick of the method

By including  the dual point, local path regularizers $\Delta_{2}(s_{a},s_{b})^{2}$,  we can represent the global log-Liklihood as a sum over local log-Likelihoods, describing regularized transitions from $s_{a}\rightarrow s_{b}$

We re-write the  objective function $f_2(\bf{x})$ (here, Joachims  confuses the notation and includes the regularizer in the log-Likehood, so be careful)

$\sum_{a=1}^{|S|}\sum_{b\in C_{a}}T_{ab}l(s_{a},s_{b})-\Omega(U,V)$

where

• $C_{a}$ is the collection of states close to $s_{a}$ (i.e  all nearest neighbors,  observed bigams, etc.),
• $T_{ab}$ is the number of transitions from $s_{a}\rightarrow s_{b}$, and is precomputed from the training data, and
• $l(s_{a},s_{b})$ is the dual point, local log-likelihood term for the transition $s_{a}\rightarrow s_{b}$

$l(s_{a},s_{b})=-\Delta_{2}(s_{a},s_{b})^{2}-\log(Z_{2}(s_{a}))$

• $\Omega(U,V)$ is the global regularizer, but which is trivial to evaluate locally (using SGD)

While the optimization problem is clearly non-convex, this decomposition allows us to ‘glue together’ a convex combination  of locally non-convex but hopefully tractable optimizations, leading to what I call a manifold optimization problem.   ( Indeed, from the point of view of mathematical physics, the dual point embedding model has the flavor of a non-trivial fiber bundle,  and the path constraints act to minimize the curvature between the bundles; in other words, a gauge field theory. )   Usually I don’t trust non-convex methods.  Still,  it is claimed good solutions can be found with 100-200 iterations using standard SGD.  We shall see…

#### SGD Update Equations

Joachims suggests solving the SGD by iterating through all states (songs) $s_p$ and updating the exit vectors $U(s_p)$ with

$U(s_p)\leftarrow U(s_p)+\dfrac{\tau}{N}\left[\underset{b\in C_p}{\sum}T_{pb}\dfrac{\partial l(s_p,s_b)}{\partial U(s_p)}+\dfrac{\partial\Omega(U,V)}{\partial U(s_p)}\right]$

and for each $s_p$, updating the entry vector $V(s_q)$ for each possible transition $s_{p}\rightarrow s_{q}$

$V(s_q)\leftarrow V(s_q)+\dfrac{\tau}{N}\left[\underset{b\in C_p}{\sum}T_{pb}\dfrac{\partial l(s_p,s_b)}{\partial V(s_q)}+\dfrac{\partial\Omega(U,V)}{\partial V(s_p)}\right]$

where

• $\tau$ is the SGD learning rate, and
• $N$ is the number of transitions sampled

The partials over the regularizers are simple

$\dfrac{\partial\Omega(U,V)}{\partial U(s_p)}=-2\lambda U(s_{p})-2\nu\vec{\Delta}_{2}(s_{p},s_{p})$

$\dfrac{\partial\Omega(U,V)}{\partial V(s_q)}=2\lambda V(s_{q})-2\nu\vec{\Delta}_{2}(s_{q},s_{q})$

The partial over $U(s_p)$  is written to  samples all possible exit vectors $s_l$

$\dfrac{\partial l(s_p,s_b)}{\partial U(s_p)}=-2\vec{\Delta}_{2}(s_p,s_b)-\dfrac{\sum_{S_l}\exp(-2\Delta_{2}(s_p,s_l)^{2})\vec{\Delta}_{2}(s_p,s_l)}{Z^{r}(s_p)}$

while the partial over $V(s_q)$ is considerably simpler since we only sample the individual transitions (from exit to entry) $s_{p}\rightarrow s_{q}$ for each state, yielding

$\dfrac{\partial l(s_p,s_q)}{\partial V(s_q)}=2\vec{\Delta}_{2}(s_p,s_q)-\dfrac{\exp(-2\Delta_{2}(s_p,s_q)^{2})\vec{\Delta}_{2}(s_p,s_q)}{Z^{r}(s_p)}$

Notice $Z^{r}$ refers to a restricted partition fucntion that only samples states in $C_p$ (i.e nearest neighbors, or what Joachims called the Landmark Heuristic)

Possible Extensions

In a simple Netlfix-style item recommender, we would simply apply some form of matrix factorization (i.e NMF) to $T_{ab}$  directly, and ignore the sequence structure.  Here, it is noted that we have some flexibility in computing  $T_{ab}$ , and we could both Kernalize and Regularize it before running the inference algorithm.

Next Steps

Now that we have a basic review of the model, next we will start making our own personalized playlists.  In our next post, we will look in detail at the LME Software and build some actuals playlists.

Stay tuned

References

See Joachims Website on Playlist Prediction and the references list

# A Ruby DSL Design Pattern for Distributed Computing

Frequently in my work in big data and machine learning, I need to run large calculations in parallel.  There are several great tools for this, including Hadoop, StarCluster,  gnu-parallel, etc.   The ruby world lacks comparable tools although ruby has had distributed computing for a long time:

http://blog.new-bamboo.co.uk/2012/04/11/the-druby-book-distributed-and-parallel-computing-with-ruby-is-finally-out

Having learned ruby while at Aardvark, but not being able to read Japanese , I decided to write my own simple tool, a ruby DSL for distributed computing.   Here I discuss the ruby DSL design pattern and it’s first implementation , for a distributed web crawler and scraper.

As usual, this post is motivated by a question on Quora, “How can we define our own query language?” .

#### DSL Example

Here we show an example of the DSL pattern in action. Imagine we have a master node and several worker nodes, communicating by a queue:

We want to load a block of work onto the master, have it distribute the block to the workers, execute the block in parallel, and cache the results back to the master.

In other languages, like Java or Python, we would need a compiler-compiler like ANTLR.  In Ruby, we can use the magical meta-programming, and the sourcify gem, to quickly build a DSL and execute it remotely.  Lets take a look at a simple example of what our DSL should look like:

#!/usr/bin/env ruby
require 'dsl'

include Dsl

worker.doit do
p "I am doing some work in parallel on several workers"
end
end

The doit &block is the basic method of our DSL.  We want to invoke doit on the master node, but have it executed remotely on the worker node(s).   We do this using the  sourcify gem , which lets us convert the block to source (i.e. Proc#to_source), and ship the source to the worker (i.e via a queue in redis) to be executed by instance_eval(source), or as a singleton method. Note: sourcify is a workaround until ruby-core officially supports Proc#to_source. It is a little buggy…so be careful.

Here is a complete example.  The DSL is broken into 3 parts: Driver, DSL::FrontEnd and DSL::CoreDSL::FrontEnd plays the role of a traditional parser, and DSL::Core is the remote interpreter. Driver and DSL::FrontEnd are executed on the master, and DSL:Core on the workers.

require 'sourcify'

module Dsl

module FrontEnd
def self.included(base)
base.send :include, InstanceMethods
end

module InstanceMethods
def init(opts={}, &block)
@doit_block = nil
yield self if block_given?
end

def doit(&block)
@doit_block = block
self
end

def block_sources
blocks = {}
blocks[:doit_block] = @doit_block.to_source
return blocks
end

end
end

end

DSL::FrontEnd uses the ruby module mixin meta-programming pattern; when Driver includes it, the doit method is added to the Driver instances.

DSL::load creates an instance of a worker  in the do block , which in an instance Driver and therefore now implements the doit method.

Using this pattern, we can create a super simple, extensible DSL that can support a wide range of methods to run on the worker nodes.

#!/usr/bin/env ruby
#
# Copyright (c) 2013 Charles H Martin, PhD
#
#  Calculated Content
#  http://calculatedcontent.com
#  charles@calculatedcontent.com
#
require 'dsl/front_end'
require 'json'

module Dsl

end

class Driver
include FrontEnd

def initialize(opts = {}, &block)
init(opts) #_dsl_front_end
yield self if block_given?
end

# store block_sources.to_json in redis
end

self.new(opts) do |core|
yield core if block_given?
end
end

end

end

When doit is invoked, the &block is converted to source, and then stored in cache (redis). When the worker is run, it grabs the block source (out of redis) and calls perform(source_text), which then evaluates the code block in context of the worker instance using instance_eval.  (Note that in a production system, we may want to compress the source to save memory)

require 'json'
module Dsl

module Core
def self.included(base)
base.send :extend, ClassMethods
# base.send :extend, InstanceMethods
end

module ClassMethods
def perform(source_text)
@doit_block = JSON.parse(source_text)['doit_block']
doit
end

def doit
instance_eval(@doit_block).call
end

end
end

end

Also, to be a bit more efficient, we could also run the worker blocks in a batch mode and evaluate the DSL as a singleton method:

module Dsl
module FastCore
def self.included(base)
base.send :extend, ClassMethods
end

def init(source_text)
block = JSON.parse(source_text)['doit_block']
define_singleton_method(:doit, block)
end

def perform(source_text)
doit
end

end
end
end

Also, notice that because the DSL block is being executed on the worker, it does not carry it’s local context from the master. Variables defined in the do block are not visible inside the doit worker method:

Dsl::load do |worker|
mvar = "I am a variable on the master"
worker.doit do
p "the worker can not see mvar"
end
end

And there we have it; a simple and lightweight way to create a simple ruby DSL for distributed computing.

If you would like to see the design pattern in action, you can can clone and run the open source  Cloud-Crawler .

Related:

http://charlesmartin14.wordpress.com/2013/04/22/cloud-crawler-an-open-source-ruby-dsl-for-crawling-the-web/

# Causality, Correlation, and Brownian Motion

A recent question on Quora asked if machine learning could learn something from the Black Scholes model of Finance

http://www.quora.com/Machine-Learning/Can-we-learn-any-lessons-from-the-Black-Scholes-solution-to-pricing-risk-to-machine-learning-algorithms-for-personalization-recommendation-algorithms

I have been curious about this myself, but from a slightly different perspective, which I share here:

#### Introduction

There is a deep relation between statistical mechanics (stat mech) and machine learning (ML). From Exponential Families to Maximum Entropy to Deep Learning, stat mech is all over ML. Here, we are interested in distinguishing between causality and correlation, and the hope is we can learn from stat mech how to do this. We are motivated by a recent neuroscience paper [3] that suggests a new approach to Granger Causality via non-equilibirum statistical mechanics, specifically the Mori-Zwanzig Projection Operator formalism (see [6], my favorite reference, or just Google it).  Also, very recent results [10] demonstrate that one can apply  the Mori-Zwanzig formalism to a set of coupled stochastic variables like we see in Granger Causality  and other time series models.

This is, however, pretty esoteric stuff, even if you have a PhD in Theoretical Chemistry like me. So before we dive in, in this post, we first look at a something ubiquitous in science: Brownian Motion

#### Causality in Statistical Mechanics

What is causality? Let’s pose it as a problem in time series analysis. Say we want to describe the dynamics of a system where one set of variables $X(t)$ dominate, but are driven, seemingly stochastically, by some other set of variables $Y(t)$.

For example, $X$ might represent a particle floating in a viscous fluid $Y$. As the particle moves, the fluid relaxes around it, leaving a slowly decaying trail.

So even though the particle $X$ appears to move randomly, its position at time $t+\delta t$ depends on both it’s entire history $X(t)$ and the history of the local surroundings $Y(t)$. The more viscous, or stiff, the fluid is, the more the surroundings matter.

When can we say that $Y(t)$ causes $X(t)$ ?

The challenge in statistical mechanics is to decide when the particle is just random (i.e. a Brownian motion) and when the fluid is strongly affecting (i.e causing) a part of the dynamics.

#### Stochastic Problems in Astronomy and Physics

To get started, we need a some basic physics.

There is a classic review of Stochastic Problems in Astronomy and Physics [1] by the late Subrahmanyan ChandrasekharNobel Laureate (Physics, 1983) and former head of the department of physics when I was a student at the University of Chicago.

I am a firm believer that every scientist should study and know the great works of the Nobel Laureates, so I strongly encourage the reader to dig deep [1] on their own–or to at least continue reading.

Chandrasekhar introduces the Langevin equation, and uses it to describe stochastic processes such as Random Walks and Mean-Reverting Walks (which are important for cointegration, but not discussed here)

#### Random Walks

When our particle is behaving randomly, and does not remember its history, we say it is undergoing a Random Walk. This is also known as Brownian Motion. Below we plot the path of several 1-D random walks over time. Notice that they all start at (0,0), but, over time, eventually diverge from each other.

Brownian motion is ubiquitous; it appears everywhere in Science. In Finance and Economics, it is described as a Wiener Process or with the Ito Stochastic Calculus. However, if we go old school, we can also describe the random walk using something familiar from high school calculus/physics. The model is

#### the Langevin equation:

$m\dfrac{d^{2}}{d^{2}t}x(t)=-m\gamma\dfrac{d}{dt}x(t)-R(t)$

where $m$ is the mass, $m\gamma$ is a friction coefficient, and $R(t)$ represents the noise, interpreted as a random force.

What is this? We all know–or have least heard of–Newton’s famous equation

$F=ma$

Force equals mass times acceleration

Here, we flip this around, and have

$ma=F=F_f-F_R$

mass ($m$) times acceleration $\dfrac{d^{2}}{d^{2}t}x(t)$ = Force

= Frictional Force ($F_f$ minus Random Force  ($F_R$)

where

the Frictional Force, $F_f=m\gamma\dfrac{d}{dt}x(t)$,  is a constant times velocity

and

the Random Force $F_R=R(t)$ is defined below

.
The random force represents the interactions of the particle with the solvent, which undergoes the thermal, random fluctuations (shown here)

We can not describe the random force with a dynamical equation, so, instead, we define $R(t)$ through it’s time-time auto-correlation functions

$\langle R(t)R(t)\rangle=0$

$\langle R(t_1)R(t_2)\rangle=2\pi_{R}G\delta(t_1-t_2)$

This correlation function is dot product in a Hilbert space; in fact it is a Kernel.

Here it is a constant; below we define a ‘memory Kernel’ that describes what is causing the random behavior (and also deals with the failures of the theory at short times)

#### Diffusion and the Einstein relation:

The particle $X$ moves around randomly; we say it diffuses. We identify a Diffusion constant, which is just the limit of average mean squared position, taken to very long times

$D=\underset{t\rightarrow\infty}{\lim}\dfrac{1}{2t}\langle (X(t)-X(t_{0}))^{2}\rangle$

It turns out the random force is not just any random force; it is intimately related to the other parameters in the equation. Einstein noted that the friction constant $m\gamma$ is related to the diffusion $D$ constant by

$D= \dfrac{k_{B}T}{m\gamma}$

where $k_{B}$ is Boltzman’s constant, and $T$ is the temperature.

With a little work (not shown here), we can also use the Einstein relation and our definition of the random force to express the Diffusion constant in terms of the time correlation of the velocity $u(t)$

$D=\int_{0}^{\infty}\langle u(t_{0})u(t_{0}+t)\rangle dt$, where

This means that if the velocities are weakly correlated (over time), the surroundings drag on the particle, and it diffuses slower. On the other hand, when the velocities are strongly correlated, the particle does not feel the surroundings, and it diffuses faster.

Because the velocity platys such an important role, frequently we express the Langevin equation in terms of the velocity

$m\dfrac{d}{dt}u(t)=-m\gamma u(t)-R(t)$

We can also express the velocity-velocity auto-correlation function as a simple exponential decay

$\langle u(t_{0})u(t_{0}+t)\rangle=\langle u^2\rangle\exp(-\gamma|t|)$

where

$\langle u^2\rangle=\dfrac{D}{\gamma}$

Not that our theory breaks down at short time scales because we expect the the velocity process to be stationary process. That is, we expect

$\langle u(t)\dfrac{d}{dt}u(t) \rangle=0$

which is clearly not satisified.

On long timescales, the Langevin equation describes a mathematical Brownian motion, but on small timescales, the Langevin equation includes inertial effects that are not present in the Brownian description. These inertial affects can be very real, as shown above in the image of the pollen in the viscous fluid, and are corrected for below.

Also, we are considering systems that actually have velocities–this is important since we don’t usually think of pure stochastic processes, such as economic or financial time series, as having a well defined or instantaneous velocity.

Still, we are on the right track. We can relate the random forces directly to the macroscopic diffusion $D_{u}$ (in velocity space) through the correlation function of the random forces

$D_{u}\sim\int_{0}^{\infty}\langle R(t_{0})R(t_{0}+t)\rangle dt$

So we can determine how strongly the environment ’causes’ the dynamics by measuring an auto-correlation function of the random forces

We begin to see the link between correlation functions and causality. This is a direct example of

the Fluctuation-Dissipation theorem

The fluctuation-dissipation theorem is a general result of statistical mechanics that quantifies the relation between the fluctuations in a system at equilibrium and its response to external perturbations. Basic examples include Brownian motion and Johnson–Nyqvist noise, but this phenomena also arise in non-equilibrium systems, and, perhaps, even in the neocortex.

Here, the same random forces that cause the erratic motion of a particle also causes drag. If something pulls the particle through the fluid, it feels this drag. The particle’s random motions and dissipative frictional forces have the same origin, or cause.

We can see this in 2 different ways:

1. We can apply an external force to the system and monitor the (linear) response
2. We can construct equilibrium velocity distribution and relate this to the self-diffusion

More generally, we can observe any external action $Y$  that “causes” $X$ to move by inferring the ‘equilibrium’ distribution, and  computing the appropriate correlation function of the random forces; this is our link to causality.

What does this have to do with Machine Learning? Notice I said infer. For a general system, we will infer the so-called ‘equilibrium’ , or most likely, (Boltzman) distribution, using a variational method. Then we can directly evaluate the correlation functions (which are the partial derivatives of our calculated partition function). But first, we need a more robust Langevin equation to work with.

#### the Generalized Langevin Equation (GLE)

We can generalize the Langevin equation by redefining the Random Force through a different Kernel. Let

$\langle R(t_1)R(t_2)\rangle=\xi(t_1-t_2)$

where $\xi(t)$ is called the memory Kernel.

This gives rise to a Generalized Langevin Equation (GLE)

$\dfrac{d}{dt}u(t)=-\int_{t_0}^{t}\xi(t-t')u(t')dt+R(t)$

A typical application of the GLE in chemical physics to extend Brownian to describe the dynamics of a particle in a viscoelastic fluid (a fluid that is both thick and deformable); the memory Kernel gives rise to a generalized frictional force.

Some examples:

When memory effects matter, we have causality.  Memory effects an appear in the velocity auto-correlation function (VACF) either through the presence of an oscillatory behavior or by means of slowly decreasing correlations [11].   In liquids, oscillations appear, for example, when particles are trapped, or caged, by their surroundings.  And correlations slowly decrease in cases of anomalous diffusion, such as liquid-like defects [12].  Indeed, these effects arise all over physics, and, we suspect, in other systems as well.   Only very recently, however, has the Mori-Zwanzig formalism been recognized as useful as a general mathematical tool for both optimal prediction [13],  Markov Models [14], and Neuroscience [3]

#### Towards a Generalized Granger Causality

In our next post, we will apply the GLE to problems in general machine learning. First, we will introduce a more abstract form of the GLE. We will then derive this GLE using the Mori-Zwanzig Projection Operator formalism from statistical mechanics.

Then, we will show how to infer the model parameters using variational techniques. Finally, we will demonstrate the relationship between Granger causality and a new measure that, we hope, will be more suitable in noisy, non-linear regimes. Please stay tuned.

#### References

[1] Engle, Robert F., Granger, Clive W. J. (1987) “Co-integration and error correction: Representation, estimation and testing“, Econometrica, 55(2), 251–276.

[3] D.Hsu and M. Hsu (2009) Zwanzig-Mori projection operators and EEG dynamics: deriving a simple equation of motion

[4] Peter Hänggi, Fabio Marchesonia! Introduction: 100 years of Brownian motion

[6] J.P. Hansen and I. R. McDonald (1986) , Theory of Simple Liquids

[7] http://www.dam.brown.edu/documents/Stinis_talk1.pdf

[8] Kubo, the Fluctuation-Dissipation Theorem

[10] Jianhua Xing and K. S. Kim , Application of the projection operator formalism to non-Hamiltonian dynamics J. Chem. Phys. 134, 044132 (2011)

[12] C. H. Martin and S. J. Singer, “The behavior of point defects in a model crystal near melting”,Phys. Rev. B44: 477 (1991)

[13] Alexandre J. Chorin*, Ole H. Hald*, and Raz Kupferman, Optimal prediction and the MoriZwanzig representation of irreversible processes, PNAS (1999)

[14a] C. L. Beck, S. Lall§, T. Liang and M. West, Model reduction, optimal prediction, and the Mori-Zwanzig representation of Markov chains
[
14b] Madhusudana Shashanka, A FAST ALGORITHM FOR DISCRETE HMM TRAINING USING OBSERVED TRANSITIONS

Bring the Noise

# Causality vs Correlation: Granger Causality

One of the most repeated mantra’s of Machine Learning is that

“A Causation is not a Correlation!”

When faced with this statement, I’m never really sure how to respond.  After all, the entire point of science is to measure correlations and other signals and determine models that explain their cause and can predict future events.

It is certainly true, however, that if we are naive, we can fool ourselves into seeing patterns that are not really there.    This is especially true in financial and econmetric time series, which do not seem to follow any of the simple laws of statistics.   In our continuing studies of noisy time series, we do not seek to address “the fundamental philosophical and epistemological question of  real causality,” [5] but, rather,

We seek practical methods that can detect a weak signal  in noisy time series--and model the underlying ’cause’

#### Science: the Search for Causation

In a previous post, we looked for specific non-linear models of signals in very noisy data, such as gravity waves and earth quake prediction.  Could we find a Gravity Wave or predict an Earthquake by observing a specific non-linear pattern–or did we just have noise?

In Economics and Finance — i.e. on Wall Street — we seek patterns in noisy time series–patterns we can trade.    Here, we really need to understand the ’cause’ of the pattern because most financial time series are highly non-stationary and it is quite easy to overtrain just about any model — and lose all our money.

In Chemical Physics, noise abounds, and we have to deal with it explicitly. Be it finding simple models for the Brownian motion of particles floating in water, or wrapping our heads around highly non-equilibrium systems that appear to be dominated by random fluctuations.

In this series of posts, we review a very interesting paper that came out a few fears ago [2] that establishes the relationship between Econometric notions Causality and the Mori & Zwanzig Projection operator formalism of non-Equilibrium statistical mechanics. We will

• establish a deep relationship between Granger Causality and the Fluctuation-Dissipation Theorem, and
• see a new test of Granger Causality that, in theory, should work much better in time series dominated by noise.

If time permits, we will also look at some more recent methods, such as the the Thermal Optimal Path Method developed by Sornette et. al. [5], both as a practical tool, and within the context of modern machine learning.  See also the Kaggle contest below

#### Granger Causality and Co-Integration

In 2003, Robert F. Engle (NYU) and Clive W. J. Granger (UC San Diego) won the Nobel Prize in Economics for their development of Statistical Methods for Economic Time Series.

In particular, they developed simple tests to determine if one time series $x(t)$ is being caused by another time series $y(t)$–even when they not correlated.  That is:

Does  $Y(t)$  cause $X(t)$?

A classic example is to look at a drunk walking her dog.  Both the drunk and her dog follow a random path, but they still try to stay close to each other.   The paths are not actually correlated.  Instead we say the 2 paths are Co-Integrated.

Co-Integration is particularly useful in Pairs Trading [6]

One selects 2 assets that stay close to each other over time enough that they can be successfully traded.  The success of Pairs Trading has stimulated the search for sophisticated tests for co-integration.  So how does one define –and test–for co-integration?

Any 2 time series are co-integrated if any linear combination of them is stationary

Or, more technically, 2 (or more) time series are co-integrated  if they share a common stochastic drift.  Granger and Engle developed a simple test for this.

Say we have a time series $x(t)$ such that $x(t+\delta t)$ depends on all previous values of $x(t)$.  Usually we  see this expressed as the linear relationship

$x_{t}=g_{1}x_{t-1}+g_{1}x_{t-2}+\cdots+g_{n}x_{t-n}$ + residual

although we can express this more generally,

$X(t+\delta t)=G\circ X(t)+R_{X}(t)$

where $G$ is any linear or possibly non-linear relationship.  We then ask what causes  $X(t)$ ?  Or, rather, if we have another time series $Y(t)$, we ask

##### Is X(t) Co-Intgrated with Y(t)?

If we naively regress $x(t)$ against $y(t)$, we might find a correlation when none exists.

“A Causation is not a Correlation!”

Instead, we propose a relationship between $Y(t)$ and $X(t)$, such that

$X(t+\delta t)=H_{X}\circ X(t)+H_{Y}\circ Y(t)+R_{X,Y}(t)$

and compare the 2 error functionals:

$E_{X}(t_{a},t_{b})=\intop_{t_{a}}^{t_{b}}dt\left[R_{X}(t),R_{X}(t)\right]$

$E_{X,Y}(t_{a},t_{b})=\intop_{t_{a}}^{t_{b}}dt\left[R_{X,Y}(t),R_{X,Y}(t)\right]$

We say that

$Y(t)$  causes $X(t)$ … in the Granger Sense …

when   $E_{X,Y}(t_{a},t_{b})$ is much smaller than $E_{X}(t_{a},t_{b})$

Another way of saying this, which might be more familiar to Machine Learning practitioners, is that we say $Y(t)$  causes $X(t)$ when the future values of $X(t)$ can be better predicted with the histories of both $X(t)$ and $Y(t)$ than just $X(t)$ alone.

##### Granger 2-step Causality Test

A classic co-integration test is the 2-step Granger test. Here, we test if linear combination of $Y(t)$  and $X(t)$ is stationary.

1. form a new time series $u(t)$, which is the difference of the two.  this is usually done with some suitable linear regression: $X(t)-\beta Y(t)=u(t)$
2. apply the Augmented Dickey Fuller  (ADF) test on the residuals $u(t)$ to see if it is stationary time series.

The ADF test estimates the regression coefficient $\rho$ of $u_{t+1}$ on $u_{t}$:

$u_{t+1}=\rho u_{t}+\epsilon$

If the coefficient  $\rho\ll 1$ , the residuals are not stationary and we have a co-integrated process; if it is close to 1, the residuals are a stationary and we are not co-integrated.  The ADF test is readily available in Matlab[3] and Python [4].

#### Contest

There is currently an open Kaggle contest on this very subject

Given samples from a pair of variables A, B, find whether A is a cause of B.

http://www.kaggle.com/c/cause-effect-pairs

We provide hundreds of pairs of real variables with known causal relationships from domains as diverse as chemistry, climatology, ecology, economy, engineering, epidemiology, genomics, medicine, physics. and sociology. Those are intermixed with controls (pairs of independent variables and pairs of variables that are dependent but not causally related) and semi-artificial cause-effect pairs (real variables mixed in various ways to produce a given outcome).

it is a bit late to enter, but if anyone wants to make a last hour contribution using one of these methods please feel free to contact me for a collaboration.

#### Next Steps

Having laid out the basic framework, the next steps are to look at the problem, from the point of of view of non-equilibirum statistical mechanics, and see if any of these old physics ideas are useful in the modern world of machine learning.  stay tuned (and anyone wanting to do the contest please contact me)

#### References

[1] Engle, Robert F., Granger, Clive W. J. (1987) “Co-integration and error correction: Representation, estimation and testing“, Econometrica, 55(2), 251–276.

[2] D.Hsu and M. Hsu (2009) Zwanzig-Mori projection operators and EEG dynamics: deriving a simple equation of motion

[5] Didier Sornette and  Wei-Xing Zhou (2004) Non-Parametric Determination of Real-Time Lag Structure between Two Time Series: The ‘Optimal Thermal Causal Path’ Method