My talk from Berkeley is up on the new Calculation Consulting YouTube channel

https://www.youtube.com/channel/UCaao8GHAVcRtSZdpObC4_Kg

My talk from Berkeley is up on the new Calculation Consulting YouTube channel

https://www.youtube.com/channel/UCaao8GHAVcRtSZdpObC4_Kg

SVMs are great for building text classifiers–if you have a set of very high quality, labeled documents.

In many real world problems, we just don’t have enough labelled data. What can we do?

- We can crawl the web for more labelled data. For this purpose, here at Calculation Consulting we built a production quality crawler
- Pay mechanical Turks to label the documents. This is actually harder than it sounds.
- We can use what labels we have, and
*try to guess the labels of the unlabeled documents.*

Guessing labels is called ** Transductive Learning** and is the topic of this post.

We have a huge collection of documents, perhaps with just a few labels. The obvious, or perhaps naive, thing to do is build a classifier using what we have, and then try to scale it out incrementally. That is, we select documents labeled that are labelled high confidence by the smaller model and use them to build a training data set for a larger SVM model:

So this seems pretty simple — what could go wrong? Since the initial model has very little labeled data, it is going to make lots of mistakes, even in the high confidence regime. We are adding in some data that is mis-labeled. As we incrementally retrain, we amplify these errors, thereby reducing overall accuracy. So what can be done?

Suppose that, we we add the new data in, instead of just naively using our smaller model to label the new data, we could automatically somehow switch the labels on these new documents, and then pick the best model from all these possibilities. That is, we run an SVM where we retain the labels on the **labeled (blue) documents**, but we let the labels on the** test (purple) documents **float. We constrain the problem so that at least unlabeled documents are labeled positive. We then pick the model with the largest margin. This leads to the following non-convex optimization problem for the Transductive SVM (TSVM)

subject to the constraint

where the is the *loss* function (such the SVM hinge loss or, as here, the L2_SVM loss). There are labelled documents and unlabeled documents , and are adjustable parameters. The constraint fixes the number of positively labeled documents; this constraint I worry about the most but lets just try it anyway.

When we don’t have enough training data, the SVM optimization problem is still convex, but the optimal solution may be spurious. Imagine that we are trying to train a binary text classifier with only say 100 documents in each class. The classifier will learn how to separate the documents, but it will not be able to generalize to new documents because it simply has not seen enough examples or really enough words to build a usable feature space. The solutions may be spurios.

Indeed, there may exists many margins that separate the data, and the largest may not be the one that generalized best. That is, without enough training data, the SVM may mis-learn and think that noise words are driving the classification.

*We want is a classifier that generalizes to unseen test documents; *

*if we don’t have test labels, perhaps we can guess them.*

When we add unlabeled documents , or test data (**circles**) , to the classifier training data. If we can guess the labels of the circles correctly, we can find the optimal solution because it is less ambiguous:

Notice that since we are seeking a maximum margin solution, we implicitly are assuming that the documents form clusters in feature space because the margin lives in the low density regime between document clusters. We are also are increasing the size of feature space, thereby greatly improving the generalizabilty of the model.

Unlike a standard (inductive) SVM, the TSVM optimization problem is highly non-convex and quite difficult to solve exactly. Clearly we can not switch the labels on all the new documents; if we add in N documents, then we have combinations to evaluate. So how do we solve it?

There are 2 well known approaches to solving the TSVM–both of which are readily available in the open source package SvmLin:

- Label Switching Heuristics (-A 2) , and
- Deterministic Annealing (-A 3)

We will discuss the Deterministic Annealing algorithm because it is very nice example of how to combine the ideas of SVMs with the techniques of Statistical Mechanics discussed in our last post. Other, more modern Transductive learning algos may use alternate methods such as convex-concave / difference-convex programming [8], [9]. Many, many other approaches for non-convex optimization also exist.

We do note, however, that in practice, I have not a great difference between the -A 2 and -A 3 options, and, in fact, sometimes the -A 2 option is more stable. This is constant with the known research. Also, in the original paper [3], TSVM refers to option -A 2, and DA to -A 3. Here, we use the term TSVM to mean any implementation.

We previously, discussed basic Statistical Mechanics and Simulated Annealing, where we derive the concept of Entropy.

To solve the TSVM / DA optimization problem, we need relax the constraint on the labels that they be integers, and, instead let them represent the probability of being in the (+) or (-) class. We then introduce a temperature that First, lets get some labels that range from 0 to 1:

so that

when , and

when

This lets us re-write the TSVM optimization problem as

where we write to specifically indicate that the loss function is the L2-SVM loss.

We now take a trick from the Ising model in Statistical Mechanics. We want to transform the labels

into probabilities that range from 0 to 1:

and are normalized to the total number of expected positively labeled instances:

*How do we incorporate probabilities into our SVM optimization problem? We are already selecting the maximum margin of the training data; we now need to also select the optimal probabilities — the maximum Entropy — of the labels for the unlabeled data. *

We extend the TSVM max-margin problem, adding a term that maximizes the total Entropy of the possible configurations of the unlabeled test data

Actually, since we use the minimize , we actually minimize the negative Entropy . This leads to a new TSVM / DA optimization problem

where the additional parameter is a Lagrange Multiplier for Entropy . We also have to add in the normalization (and class balancing) constraint

Handing this constraint is a bit more complicated, and we refer to [3]; we do note that the constraint can be re-written into what looks like a Fermi energy, which is quite interesting.

*We maximize the Margin on the training (and test) sets, *

*and maximize** the Entropy on the test set.*

Solving this optimization problem requires alternating between solving the labeled SVM optimization problem (maximizing the margin ) and finding the optimal probabilities for the unlabeled instances (optimizing ). The result are labels (with confidences) for our previously unlabeled data, and, if we want to use it, a new SVM predictive model.

We also need to be careful not to overtrain. So I will select out the newly labeled data with the highest confidence labels and then use this as training data for the production SVM (i.e LibLinear). This would allow the data scientist to then take a label set of say 100 documents per class, and automatically extend it to say 10,000 labeled documents per class without the use of mechanical Turks or crawling the web.

Despite the wonderful advantages promised by Transductive (and SemiSupervised) Learning, it is quite hard to apply in practice and has not achieved even close to the popularity of its Supervised and Unsupervised counterparts. TSVMs need both proper tuning [8] and well designed data sets. We need to understand it’s practical implementation much better to use it.

An initial training data set may be naively separable if it is small and has a large number of features :

As a data set grows, we may have far more instances than features:

*potentially unseparable:*

* *

which makes it far more probable that some instances will lie in the slack region between classes.

We usually run a tool like LibLinear , which implements either a soft-margin SVM or L2-regularized Logistic Regression. Indeed, in practice is usually impossible to tell the difference between these 2 models. The additional slack terms can be though of as an extended regularizer or as modifying the SVM loss function. Indeed, even Vapnik himself has argued that it can be very beneficial to run an SVM with a large number of adjustable slack variables (i.e in master learning and/or his universum model [5]). The SvmLin implementation of the TSVM is a hard margin (binary, no slack variables) SVM, but it does use the L2-SVM loss function (and not the harder hinge loss).

It is suggested to run the TSVM slowly and repeatedly to relabel documents, and, on each iteration, to only take new labels that have moderately high confidence (and therefore, hopefully, outside the slack region)

Additionally, SvmLin only implements a binary SVM, whereas many practical applications require a multi-class SVM. More generally, one might try to extend the TSVM to a DA type algorithm using say a Potts model–a classic model in physics that generalizes an Ising model…but for now we have to use SvmLin to get practical solutions.

In a normal SVM, balancing the class data is important; in a Transducitve-SVM, it is absolutely critical. For example, this recent study on TSVMs discussed the issue. This can be quite challenging in real world, production environments, and if you want to apply this powerful technique, extra care should be taken to design the TSVM (training + test) sets.

This means, in order to apply SvmLin, it is necessary to carefully construct a collection of 1-vs-1 or 1-vs-all binary SVMs. Each individual TSVM classifier should be designed to extend and correct the label set of a given set of instances or documents.

For example, suppose we have a 5 class multiclass model, but we only have a few labeled instances for each class. If we try to extend the set of instances for class 1, we need to select either four data sets, for four 1-vs-1 binary TSVMs, or one large data set for a 1-vs-all binary TSVM. If we choose the 1-vs-all approach, then we have to make sure the positive and negative sets are balanced in a way that the final labels are better than the initial–and this is not easy. The fraction is a normalization constraint, and keeping it fixed means the negative set must be carefully constructed to ensure that final result has positively labeled instances that represent the actual positive class well.

The TSVM can help add labeled data to a well designed model; it can not be used to repair a poorly designed classification model .

Cnsider a binary SVM trained on a small training set of cat and dog videos. We then apply the model to some unseen test videos, and select a large number of videos that were labelled either cats or dogs. The model will certainly make mistakes, but we can hope that the errors are balanced across both classes. So when we apply the TSVM to the entire training + test set, half of the data will still actually be dog videos.

On the other hand, let’s consider the poorly designed classifier shown below. Here, we have 3 classes: dogs, cats, and animals. The third class is trained for animal videos, and may contain dogs, cats, horses, mice, etc. This is a bad design, and we would see this, say, in the confusion matrix for the entire model. When we create the increment set, to train the TSVM, we have no idea what fraction of the training + test set are *actually* dog videos. So we can not really fix the normalization constraint , and resulting TSVM model may behave screwy.

Transductive Learning was first proposed by more than thirty years ago by Vapnik and Chervonenkis [6] , and it is a critical idea in the development of the famous VC Theory of Statistical Learning [1]. In a future post, we will look in detail at the role of Transductive Inference in the VC theory and its modern form using the Rademacher Complexity [6].

Practically, there are few useful Transductive and Semi-Supervised ML algos to use :

- the SvmLin open source code , which we use and discuss here

- Joachim’s SVMLight , the first high performance TSVM implementations, but it is not open source
- The famous scikit-learn package has 1 method, a semi-supervised Label Propagation algorithm.
- There are a few open source Semi-Supervised packages
- the SVM based implementations are MatLab toolboxes, including a ManifoldLearn, andLapSVM.
- It appears the enterprise tool RapidMiner does implement a commercial version of a TSVM, although I personally have not tried it.

In our next post, we will run some experiments with the SvmLin code and see how well it works on some real world datasets.

* To get involved, goto *https://github.com/CalculatedContent/tsvm

[1] Vladimir Vapnik. *Statistical learning theory*. Wiley, 1998. ISBN 978-0-471-03003-4.

[2] T. Joachims, *Transductive Inference for Text Classiﬁcation using Support Vector Machines*, ICML 1998

[3] V. Sindhwani, S. Sathiya Keerthi, *Large Scale Semi-supervised Linear SVMs*, 2006

[4] O. Chapelle, B. Scholkopf, A. Zien, *Semi-Supervised Learning*, 2006

[5] http://www.cs.princeton.edu/courses/archive/spring13/cos511/handouts/vapnik-slides.pdf

[6] Vapnik, V., & Chervonenkis, A. *The Theory of Pattern Recognition, *Moscow 1974

[7] Ran El-Yaniv, Dmitry Pechyony, *Transductive Rademacher Complexity and its Applications *

[8] J. Wang, X. Shen, W. Pan *On Transductive Support Vector Machines*

[9] C. Yu , *Transductive Learning of Structural SVMs via Prior Knowledge Constraints*

I wrote this for my buddy Sebass; just a quick review of quantum stat mech:

In our last post, we presented a formalism for music recommendations called Logistic Markov Embedding (LME). This technique uses some mathematics that arises in other modern machine learning methods, such as Maximum Entropy (MaxEnt) methods, and the Restricted Boltzmann Machines (RMBs) in Deep Learning.

In this post, we are going to take a step back and understand where the idea for a metric embedding comes from and the mathematical formalism employed by Joachims. The Metric Embedding in MLE is a kind of Energy model — an idea from Quantum Statistical Mechanics.So we will review the pedagogic formalism of Statistical Mechanics for Thermodynamics.

We will follow the classic text by T. Hill, and one of the first books I ever read in college. Also, we will not look at more abstract version of Quantum Statistical Mechanics because we don’t need to deal with the phase space formulation until we look at time-dependent processes like Causality (but eventually we will get there). For a deeper view of this subject, we defer to the classic text by Jaynes, Probability Theory: The Logic of Science.

Frequently we see expressions for probabilities, as in the LME approach, written as:

where is partition function, given by

Naively, we can just think that we are defining the probability in an abstract way, and that is just the normalization. But there’s more to it. In fact, it is deeply connected the Boltzman / Gibbs Distribution

*So why in the world do we do this?*

In simpler machine learning algos like SVMs and Logistic Regression, we can readily solve the optimization problems using Stochastic Gradient Descent (SGD)– because these are convex optimization problems.

Below we see two optimization surfaces: our goal is to find the global minimum. The top problem is easy to solve, and just about any decent descent method will do (unless we need sparse solutions, which makes this a little trickier). The bottom problem, however, is filled with hills and bumps, and even looking at it, we are not sure where the bottom is.

So how do we find it? Typical solvers for non-convex problems include Genetic Algorithms, Convex-Concave/DC programming, BackProp (for ANNs), and Simulated Annealing.

In Simulated Annealing, we run the solver until it gets stuck. We then heat the system, making it jump over local humps. We then cool it down–we run the solver again until it stops. We repeat this over and over until we find the global minimum . A common application is in the simulation of liquids and crystals. To do this, we need to define a temperature.

In traditional probability theory, we require the individual probabilities sum to 1:

Of course this is always true, but we don’t have to enforce this all the time; we can relax this constraint and simply enforce it at the end of the problem. This is exactly what we do in Simulate Annealing. Instead of creating probability models, we create Energy Models. The energies are essentially the log of unnormalized probabilities

Which is what we see in LME. We only require the total energy to be constant–not 1. This means when we compute

we need the normalization factor

Lets say we have a bunch of objects that we would like to distribute (or cluster) into boxes, sets, or sequences in some optimal way. We might be placing balls into boxes, or we could be assigning documents into clusters, or we are finding personalized playlists. We want to find the optimal way of doing this through some constrained optimization problem. Imagine we have a container partitioned into small, disjoint boxes or sets . We have

- the total number of objects $latex \mathcal{N} $ is fixed. For us, the total number songs.
- the total ‘Volume’ V is fixed. This will come up later when we create the metric embedding

We wanted to find the optimal probability of a song to appear in a playlist. Here, more generally, we seek the probability , of putting an object in a box , subject to our prior knowledge of the problem. In other approaches, we might define a conditional probability for each box. Here, we say each box has an energy . We require total Energy of the system is always fixed. We can place objects into each box . We only require

Each combination, or set of numbers, constitutes a *distribution* . There are many ways of doing this; total number of combinations is given by the multinomial coefficients

Each combination has a specific configuration of energies . For example, if we place 4 objects A, B, C, and D into boxes 2, 3, 2, and 1, respectively the total energy of this configuration (or micro-state) is given by

Of course, this configuration is just like a playlist, except we have not accounted for the ordering explicitly here. We can consider other configurations as well

but they all have the same total energy . This leads to the constraints

The probabilities of being in each box is easy computed for small $latex \mathcal{N} $

where is the average number of objects in each box. And we can compute the average energy

We want to find the *most likely distribution* of objects, subject to the constraints on the distribution. But as soon as N gets large, we can not just evaluate these sums because they are too large. But we note that the *most likely distribution* is very highly peaked around

as is the most likely average Energy

Which means that we can approximate the most likely distribution by trying to find the minimum value of . Since this distribution is so peaked, we can approximate this very well by minimizing the log of the distribution

subject to our constraints

and .

We can solve this optimization using the method of Lagrange multiplies, as we do in many convex optimization problems in ML. This yields

where are the Lagrange multiplies (and which usually appear as adjustable parameters in other ML problems). We want to solve this problem in the limit . We take advantage of

to write

Now we can evaluate the derivatives explicitly

With this simple expression, we can find the optimized, average number of objects in each box

where

and the optimal probability of being in each box

which is known as as the Boltzman Distribution, or , more generally, the Gibbs Measure.

The average energy of the system, which is a kind of optimal normalization for the problem, is given by

We now recognize that the energies are log probabilities

and we call the normalization factor the Partition Function

We call the inverse temperature, because in traditional thermodynamics.

We have completed out mission; we have introduced the notion of a Temperature into our probabilistic models.

We should not use a formalism unless we intent to take advantage of it. Joachim’s uses the LME formalism to create a complicated model, but he only uses SGD to solve it. We can do better than this. Indeed, for Deep Learning, we will have to.

Of course, it is a bit more technical than this. And this is not the only way to solve the problem. Plus* we still need Regularization to avoid overfitting, and obtain sparse solutions. *We will live this to another post.

Still, we see that the language of Statistical Physics offers many advantages over traditional probabilistic models–and this why it is so popular in machine learning. In our next post we show how to combine the Entropy with an SVM to provide a very powerful, hybrid, semi-supervised learning method called a Transductive SVM.

In this post, we are going to see how to build our own music recommender, using the Logistic Metric Embedding (LME) model developed by Joachims (of SVMLight fame). The core idea of this recommender is that people listen to songs in a specific sequence, and that certain songs sound better when they follow other songs.

We will break this post into 2 parts:

- First, we will summarize the key ideas of the mathematical formulation of the LME.
- Second, we will actually run the LME software and see how to build our own music playlists.

But first, a little history

Introduction: Some History of Sequence Prediction

I first began working in sequence prediction in about 2001, where I developed the Effective Operator approach to Personalized Recommenders. This was a recommender system that used a kind of Bayesian / Regularized Linear Regression. Some years later, working with eBay, we researched the effectiveness of search relevance (a kind of sequential recommendation) using techniques including Joachims’ earlier work on Structural SVMs. Recent work has focussed on built upon the Structural SVMs for Ranking, leading to a new, simpler approach, Metric Learning to Rank. Metric Embedding is, in general, a good thing to know about, and you can learn about it more generally from the University of Chicago course: CMCS 39600: Theory of Metric Embeddings. Joachims has subsequently also considered metric learning, and here, we examine some his recent research in metric learning for sequence prediction. Specifically, we look in detail at the Logistic Metric Embedding (LME) model for predicting music playlists.

Sequence Prediction with Local Metric Embeddings

We would like to recommend playlist of songs. More generally, we seek to estimate the probability of observing a new, directed sequence , given a training set of a very large set of directed sequences . To do this, we define a conditional, probabilistic, vector space model that retains the local sequence structure of the sample data, and we try to learn the optimal model via regularized maximum-likelihood estimation (MLE) .

Developed by Joachims (and following his notation), we seek to predict a (directed) sequence of states (i.e. an ngram sequence, a playlist of songs, etc. ) from a large collection of states , where . We model each sequence as a Markov Chain, so that

We embed the sequence into the Latent space, or the d-fold Tensor space (which is isomorphic to a dimensional Euclidean space )

, such that

The inference problem is to find ; it is solved using a straightforward Stochastic Gradient Descent (SGD) algorithm, restricted to a kind of nearest-neighbor sampling called a *Landmark Heuristic*

Because we have a Markov Model, we will see that all we need is to write down the *local log-Likelihood, *defined through the conditional *transition* probabilities , and, as usual, some regularization. Joachims ignores long range interactions and only regularize locally, and I will note some opportunities to address this.

Also, we note that assuming a Markov Model is actually restrictive (Recall our post on Brownian motion and memory effects in chemical physics) . First, it assumes that to predict the next song, the only songs that matter are the ones we listened to before this song. Additionally, Markov Models can ignore long-range interdependencies. A NMF based recommender does not assume this, and they include long-range effects.

Recent work (2013) shows how to parallelize the method on distributed memory architectures, although we will do all of our work on a shared memory architecture. they even provide sample code!

For example, a typical embedding is visualized as:

The Online LME Demo provides an interactive view of the embedding. It also includes simple modifications for adding popularity, user bias, and even semantic tags to the model (not discussed here)

Joachims introduces 2 different models, called the** Single Point **and

By a Metric Embedding, we mean that the conditional probability is is defined with the metric, or distance , in the Latent space where we have embedded our states (songs).

*What is the Embedding?*

We associate each state with a vector

Notice Joachims writes and to distinguish between vectors associated with elements of a playlist and arbitrary elements of .

In a model for , say, playlists, each vector it is , initially, just a point on a -dimensional hypercube. We then run an inference algorithm which learns the optimal vectors that minimize the total regularized log-likelihood , as computed over the entire training data, and subject to the specific model chosen.

*What is ?*

** In the Single Point model,** is just the Euclidean distance between and

This is just a standard Vector Space model.

** In the Dual Point model,** however, we associate 2 points (vectors) , with each state

The distance between 2 states the asymmetric divergence

This is like a* Vector Field Model,* where each state is mapped to a directed vector

.

(In mathematical physics, the embedding is a called a Vector Field or, more generally, a Fiber Bundle ).

The Latent metric is the Euclidean distance between the end point of and the starting point of within a directed sequence .

We can also construct *sequence-independent *() vectors by simple averaging over a large sample of known sequences (i.e. song playlists).

We want to find

where is the point partition function, given by

This is just the normalization of the conditional probability (above). Note that the sum is over all possible states , and for numerical performance, we will restrict the sum to nearby states.

Actually, the final model used is somewhat more complicated because it includes other features to adjust for popularity, user bias, diversity, etc.

The optimal solution is found by maximizing the log-Likelihood , subject to regularization to avoid over-training.

(in a later post, I will explain the partition function ad where it comes from; seriously, I have most of this written up already)

For the Single Point Model, the objective function is the log-Likelihood plus a standard 2-norm regularizer

For the Dual Point Model, the objective function includes 2 kinds of regularizers

*The first regularizer* is

and ensures that the entry and exit vectors don’t get too large. It is similar to the kind seen in traditional matrix factorization techniques.

*The second regularizer* ensures the small length sequences, and is similar in spirit to the kind of path regularizers seen in Minimum Path Basis Pursuit methods (where, ideally, we would seek use an L1-norm for this). This part is critical because it allows to decompose what looks like a large, messy, non-convex, global optimization problem into a convex sum of local, albeit non-convex , optimization problems:

A Local Manifold Optimization Problem

*Here is the real trick of the method*

By including the dual point, local path regularizers , we can represent the *global* log-Liklihood as a sum over ** local** log-Likelihoods, describing

We re-write the objective function (here, Joachims confuses the notation and includes the regularizer in the log-Likehood, so be careful)

where

- is the collection of states close to (i.e all nearest neighbors, observed bigams, etc.),
- is the number of transitions from , and is precomputed from the training data, and
- is the dual point,
*local*log-likelihood term for the transition

- is the global regularizer, but which is trivial to evaluate locally (using SGD)

While the optimization problem is clearly non-convex, this decomposition allows us to ‘glue together’ a convex combination of locally non-convex but hopefully tractable optimizations, leading to what I call a manifold optimization problem. ( Indeed, from the point of view of mathematical physics, the dual point embedding model has the flavor of a non-trivial fiber bundle, and the path constraints act to minimize the curvature between the bundles; in other words, a gauge field theory. ) Usually I don’t trust non-convex methods. Still, it is *claimed* good solutions can be found with 100-200 iterations using standard SGD. We shall see…

Joachims suggests solving the SGD by iterating through all states (songs) and updating the exit vectors with

and for each , updating the entry vector for each possible transition

where

- is the SGD learning rate, and
- is the number of transitions sampled

The partials over the regularizers are simple

The partial over is written to samples all possible exit vectors

while the partial over is considerably simpler since we only sample the individual transitions (from exit to entry) for each state, yielding

Notice refers to a restricted partition fucntion that only samples states in (i.e nearest neighbors, or what Joachims called the *Landmark Heuristic*)

Possible Extensions

In a simple Netlfix-style item recommender, we would simply apply some form of matrix factorization (i.e NMF) to directly, and ignore the sequence structure. Here, it is noted that we have some flexibility in computing , and we could both Kernalize and Regularize it before running the inference algorithm.

Next Steps

Now that we have a basic review of the model, next we will start making our own personalized playlists. In our next post, we will look in detail at the LME Software and build some actuals playlists.

Stay tuned

References

See Joachims Website on Playlist Prediction and the references list

Frequently in my work in big data and machine learning, I need to run large calculations in parallel. There are several great tools for this, including Hadoop, StarCluster, gnu-parallel, etc. The ruby world lacks comparable tools although ruby has had distributed computing for a long time:

Having learned ruby while at Aardvark, but not being able to read Japanese , I decided to write my own simple tool, a ruby DSL for distributed computing. Here I discuss the ruby DSL design pattern and it’s first implementation , for a distributed web crawler and scraper.

As usual, this post is motivated by a question on Quora, “How can we define our own query language?” .

Here we show an example of the DSL pattern in action. Imagine we have a master node and several worker nodes, communicating by a queue:

We want to load a block of work onto the master, have it distribute the block to the workers, execute the block in parallel, and cache the results back to the master.

In other languages, like Java or Python, we would need a compiler-compiler like ANTLR. In Ruby, we can use the magical meta-programming, and the sourcify gem, to quickly build a DSL and execute it remotely. Lets take a look at a simple example of what our DSL should look like:

#!/usr/bin/env ruby $LOAD_PATH << "." require 'dsl' include Dsl Dsl::load do |worker| worker.doit do p "I am doing some work in parallel on several workers" end end

The `doit &block`

is the basic method of our DSL. We want to invoke `doit`

on the master node, but have it executed remotely on the worker node(s). We do this using the sourcify gem , which lets us convert the block to source (i.e. `Proc#to_source`

), and ship the source to the worker (i.e via a queue in redis) to be executed by `instance_eval(source)`

, or as a singleton method. Note: sourcify is a workaround until ruby-core officially supports `Proc#to_source`

. It is a little buggy…so be careful.

Here is a complete example. The DSL is broken into 3 parts: `Driver`

, `DSL::FrontEnd`

and `DSL::Core`

. `DSL::FrontEnd`

plays the role of a traditional parser, and `DSL::Core`

is the remote interpreter. `Driver`

and `DSL::FrontEnd`

are executed on the master, and `DSL:Core`

on the workers.

require 'sourcify' module Dsl module FrontEnd def self.included(base) base.send :include, InstanceMethods end module InstanceMethods def init(opts={}, &block) @doit_block = nil yield self if block_given? end def doit(&block) @doit_block = block self end def block_sources blocks = {} blocks[:doit_block] = @doit_block.to_source return blocks end end end end

`DSL::FrontEnd`

uses the ruby module mixin meta-programming pattern; when `Driver`

includes it, the `doit`

method is added to the `Driver`

instances.

`DSL::load`

creates an instance of a `worker`

in the `do block`

, which in an instance `Driver`

and therefore now implements the `doit`

method.

Using this pattern, we can create a super simple, extensible DSL that can support a wide range of methods to run on the worker nodes.

#!/usr/bin/env ruby # # Copyright (c) 2013 Charles H Martin, PhD # # Calculated Content # http://calculatedcontent.com # charles@calculatedcontent.com # require 'dsl/front_end' require 'json' module Dsl def Dsl.load(opts={},&block) Driver.load(opts,&block) end class Driver include FrontEnd def initialize(opts = {}, &block) init(opts) #_dsl_front_end yield self if block_given? end def load # store block_sources.to_json in redis end def self.load(opts={}, &block) self.new(opts) do |core| yield core if block_given? core.load end end end end

When `doit`

is invoked, the `&block`

is converted to source, and then stored in cache (redis). When the worker is run, it grabs the block source (out of redis) and calls `perform(source_text)`

, which then evaluates the code block in context of the worker instance using `instance_eval. `

(Note that in a production system, we may want to compress the source to save memory)

require 'json' module Dsl module Core def self.included(base) base.send :extend, ClassMethods # base.send :extend, InstanceMethods end module ClassMethods def perform(source_text) @doit_block = JSON.parse(source_text)['doit_block'] doit end def doit instance_eval(@doit_block).call end end end end

Also, to be a bit more efficient, we could also run the worker blocks in a batch mode and evaluate the DSL as a singleton method:

module Dsl module FastCore def self.included(base) base.send :extend, ClassMethods end def init(source_text) block = JSON.parse(source_text)['doit_block'] define_singleton_method(:doit, block) end def perform(source_text) doit end end end end

Also, notice that because the DSL block is being executed on the worker, it does not carry it’s local context from the master. Variables defined in the do block are not visible inside the `doit`

worker method:

Dsl::load do |worker| mvar = "I am a variable on the master" worker.doit do p "the worker can not see mvar" end end

And there we have it; a simple and lightweight way to create a simple ruby DSL for distributed computing.

If you would like to see the design pattern in action, you can can clone and run the open source Cloud-Crawler .

Related:

A recent question on Quora asked if machine learning could learn something from the Black Scholes model of Finance

I have been curious about this myself, but from a slightly different perspective, which I share here:

There is a deep relation between statistical mechanics (stat mech) and machine learning (ML). From Exponential Families to Maximum Entropy to Deep Learning, stat mech is all over ML. Here, we are interested in distinguishing between causality and correlation, and the hope is we can learn from stat mech how to do this. We are motivated by a recent neuroscience paper [3] that suggests a new approach to Granger Causality via non-equilibirum statistical mechanics, specifically the Mori-Zwanzig Projection Operator formalism (see [6], my favorite reference, or just Google it). Also, very recent results [10] demonstrate that one can apply the Mori-Zwanzig formalism to a set of coupled stochastic variables like we see in Granger Causality and other time series models.

This is, however, pretty esoteric stuff, even if you have a PhD in Theoretical Chemistry like me. So before we dive in, in this post, we first look at a something ubiquitous in science: *Brownian Motion*

What is causality? Let’s pose it as a problem in time series analysis. Say we want to describe the dynamics of a system where one set of variables dominate, but are driven, seemingly stochastically, by some other set of variables .

For example, might represent a particle floating in a viscous fluid . As the particle moves, the fluid relaxes around it, leaving a slowly decaying trail.

So even though the particle appears to move randomly, its position at time depends on both it’s entire history and the history of the local surroundings . The more viscous, or stiff, the fluid is, the more the surroundings matter.

*When can we say that causes ?*

The challenge in statistical mechanics is to decide when the particle is just random (i.e. a Brownian motion) and when the fluid is strongly affecting (i.e causing) a part of the dynamics.

To get started, we need a some basic physics.

There is a classic review of Stochastic Problems in Astronomy and Physics [1] by the late Subrahmanyan Chandrasekhar, Nobel Laureate (Physics, 1983) and former head of the department of physics when I was a student at the University of Chicago.

I am a firm believer that every scientist should study and know the great works of the Nobel Laureates, so I strongly encourage the reader to dig deep [1] on their own–or to at least continue reading.

Chandrasekhar introduces the Langevin equation, and uses it to describe stochastic processes such as Random Walks and Mean-Reverting Walks (which are important for cointegration, but not discussed here)

When our particle is behaving randomly, and does not remember its history, we say it is undergoing a Random Walk. This is also known as Brownian Motion. Below we plot the path of several 1-D random walks over time. Notice that they all start at (0,0), but, over time, eventually diverge from each other.

Brownian motion is ubiquitous; it appears everywhere in Science. In Finance and Economics, it is described as a Wiener Process or with the Ito Stochastic Calculus. However, if we go old school, we can also describe the random walk using something familiar from high school calculus/physics. The model is

where is the mass, is a friction coefficient, and represents the noise, interpreted as a random force.

*What is this?* We all know–or have least heard of–Newton’s famous equation

*Force equals mass times acceleration*

Here, we flip this around, and have

*mass () times acceleration = Force *

*= Frictional Force () minus Random Force ()*

where

*the Frictional Force, , is a constant times velocity *

and

the Random Force is defined below

.

The random force represents the interactions of the particle with the solvent, which undergoes the thermal, random fluctuations (shown here)

We can not describe the random force with a dynamical equation, so, instead, we define through it’s time-time auto-correlation functions

This correlation function is dot product in a Hilbert space; in fact it is a Kernel.

Here it is a constant; below we define a ‘memory Kernel’ that describes *what is causing* the random behavior (and also deals with the failures of the theory at short times)

The particle moves around randomly; we say it diffuses. We identify a Diffusion constant, which is just the limit of average mean squared position, taken to very long times

It turns out the random force is not just any random force; it is intimately related to the other parameters in the equation. Einstein noted that the friction constant is related to the diffusion constant by

where is Boltzman’s constant, and is the temperature.

With a little work (not shown here), we can also use the Einstein relation and our definition of the random force to express the Diffusion constant in terms of the time correlation of the velocity

, where

This means that if the velocities are weakly correlated (over time), the surroundings drag on the particle, and it diffuses slower. On the other hand, when the velocities are strongly correlated, the particle does not feel the surroundings, and it diffuses faster.

Because the velocity platys such an important role, frequently we express the Langevin equation in terms of the velocity

We can also express the velocity-velocity auto-correlation function as a simple exponential decay

where

Not that our theory breaks down at short time scales because we expect the the velocity process to be stationary process. That is, we expect

which is clearly not satisified.

On long timescales, the Langevin equation describes a mathematical Brownian motion, but on small timescales, the Langevin equation includes inertial effects that are not present in the Brownian description. These inertial affects can be very real, as shown above in the image of the pollen in the viscous fluid, and are corrected for below.

Also, we are considering systems that actually have velocities–this is important since we don’t usually think of pure stochastic processes, such as economic or financial time series, as having a well defined or instantaneous velocity.

Still, we are on the right track. We can relate the random forces directly to the macroscopic diffusion (in velocity space) through the correlation function of the random forces

* So we can determine how strongly the environment ’causes’ the dynamics by measuring an auto-correlation function of the random forces*

We begin to see the link between correlation functions and causality. This is a direct example of

the *Fluctuation-Dissipation theorem*

The fluctuation-dissipation theorem is a general result of statistical mechanics that quantifies the relation between the fluctuations in a system at equilibrium and its response to external perturbations. Basic examples include Brownian motion and Johnson–Nyqvist noise, but this phenomena also arise in non-equilibrium systems, and, perhaps, even in the neocortex.

Here, the same random forces that cause the erratic motion of a particle also causes drag. If something pulls the particle through the fluid, it feels this drag. The particle’s random motions and dissipative frictional forces have the same origin, or cause.

We can see this in 2 different ways:

- We can apply an external force to the system and monitor the (linear) response
- We can construct equilibrium velocity distribution and relate this to the self-diffusion

* More generally, we can observe any external action that “causes” to move by inferring the ‘equilibrium’ distribution, and computing the appropriate correlation function of the random forces; this is our link to causality.*

What does this have to do with Machine Learning? Notice I said *infer*. For a general system, we will infer the so-called ‘equilibrium’ , or most likely, (Boltzman) distribution, using a variational method. Then we can directly evaluate the correlation functions (which are the partial derivatives of our calculated partition function). But first, we need a more robust Langevin equation to work with.

We can generalize the Langevin equation by redefining the Random Force through a different Kernel. Let

where is called the memory Kernel.

This gives rise to a Generalized Langevin Equation (GLE)

A typical application of the GLE in chemical physics to extend Brownian to describe the dynamics of a particle in a viscoelastic fluid (a fluid that is both thick and deformable); the memory Kernel gives rise to a generalized frictional force.

Some examples:

When memory effects matter, we have causality. *Memory effects an appear in the velocity auto-correlation function (VACF) either through the presence of an oscillatory behavior or by means of slowly decreasing correlations* [11]. In liquids, oscillations appear, for example, when particles are trapped, or caged, by their surroundings. And correlations slowly decrease in cases of anomalous diffusion, such as liquid-like defects [12]. Indeed, these effects arise all over physics, and, we suspect, in other systems as well. Only very recently, however, has the Mori-Zwanzig formalism been recognized as useful as a general mathematical tool for both optimal prediction [13], Markov Models [14], and Neuroscience [3]

In our next post, we will apply the GLE to problems in general machine learning. First, we will introduce a more abstract form of the GLE. We will then derive this GLE using the Mori-Zwanzig Projection Operator formalism from statistical mechanics.

Then, we will show how to infer the model parameters using variational techniques. Finally, we will demonstrate the relationship between Granger causality and a new measure that, we hope, will be more suitable in noisy, non-linear regimes. Please stay tuned.

[1] Engle, Robert F., Granger, Clive W. J. (1987) “Co-integration and error correction: Representation, estimation and testing“, *Econometrica*, 55(2), 251–276.

[2] *Stochastic Problems in Physics and Astronomy*, Chandrasekhar, S. (The University of Chicago), Reviews of Modern Physics, vol. 15, Issue 1, pp. 1-89

[3] D.Hsu and M. Hsu (2009) Zwanzig-Mori projection operators and EEG dynamics: deriving a simple equation of motion

[4] Peter Hänggi, Fabio Marchesonia! Introduction: 100 years of Brownian motion

[5] http://web4.cs.ucl.ac.uk/staff/C.Bracegirdle/BayesianConditionalCointegration.php

[6] J.P. Hansen and I. R. McDonald (1986) , Theory of Simple Liquids

[7] http://www.dam.brown.edu/documents/Stinis_talk1.pdf

[8] Kubo, the Fluctuation-Dissipation Theorem

[9] Brownian motion: the Langevin model

[10] Jianhua Xing and K. S. Kim^{ }, Application of the projection operator formalism to non-Hamiltonian dynamics J. Chem. Phys. **134**, 044132 (2011)

[11] Anatolii V. Mokshin, Renat M. Yulmetyev, Peter Hanggi, A simple measure of memory for dynamical processes described by the generalized Langevin equation

[12] C. H. Martin and S. J. Singer, “The behavior of point defects in a model crystal near melting”,*Phys. Rev. ***B44**: 477 (1991)

[13] Alexandre J. Chorin*, Ole H. Hald*, and Raz Kupferman, Optimal prediction and the MoriZwanzig representation of irreversible processes, PNAS (1999)

[14a] C. L. Beck, S. Lall§, T. Liang and M. West, Model reduction, optimal prediction, and the Mori-Zwanzig representation of Markov chains

[14b] Madhusudana Shashanka, A FAST ALGORITHM FOR DISCRETE HMM TRAINING USING OBSERVED TRANSITIONS

One of the most repeated mantra’s of Machine Learning is that

*“A Causation is not a Correlation!”*

When faced with this statement, I’m never really sure how to respond. After all, the entire point of science is to measure correlations and other signals and determine models that explain their cause and can predict future events.

It is certainly true, however, that if we are naive, we can fool ourselves into seeing patterns that are not really there. This is especially true in financial and econmetric time series, which do not seem to follow any of the simple laws of statistics. In our continuing studies of noisy time series, we do not seek to address “the fundamental philosophical and epistemological question of real causality,” [5] but, rather,

*We seek practical methods that can detect a weak signal in noisy time series**--**and** model the underlying ’cause’** *

In a previous post, we looked for specific non-linear models of signals in very noisy data, such as gravity waves and earth quake prediction. Could we find a Gravity Wave or predict an Earthquake by observing a specific non-linear pattern–or did we just have noise?

In Economics and Finance — i.e. on Wall Street — we seek patterns in noisy time series–patterns we can trade. Here, we really need to understand the ’cause’ of the pattern because *most financial time series are highly non-stationary* and it is quite easy to overtrain just about any model — and lose all our money.

In Chemical Physics, noise abounds, and we have to deal with it explicitly. Be it finding simple models for the Brownian motion of particles floating in water, or wrapping our heads around highly non-equilibrium systems that appear to be dominated by random fluctuations.

In this series of posts, we review a very interesting paper that came out a few fears ago [2] that establishes the relationship between Econometric notions Causality and the Mori & Zwanzig Projection operator formalism of non-Equilibrium statistical mechanics. We will

- establish a deep relationship between Granger Causality and the Fluctuation-Dissipation Theorem, and
- see a new test of Granger Causality that, in theory, should work much better in time series dominated by noise.

If time permits, we will also look at some more recent methods, such as the the Thermal Optimal Path Method developed by Sornette et. al. [5], both as a practical tool, and within the context of modern machine learning. See also the Kaggle contest below

In 2003, Robert F. Engle (NYU) and Clive W. J. Granger (UC San Diego) won the Nobel Prize in Economics for their development of Statistical Methods for Economic Time Series.

In particular, they developed simple tests to determine if one time series is being *caused *by another time series –even when they not correlated. That is:

*Does cause ?*

A classic example is to look at a drunk walking her dog. Both the drunk and her dog follow a random path, but they still try to stay close to each other. The paths are not actually correlated. Instead we say the 2 paths are **Co-Integrated**.

Co-Integration is particularly useful in Pairs Trading [6]

One selects 2 assets that stay close to each other over time enough that they can be successfully traded. The success of Pairs Trading has stimulated the search for sophisticated tests for co-integration. So how does one define –and test–for co-integration?

*Any 2 time series are co-integrated if any linear combination of them is stationary*

Or, more technically, 2 (or more) time series are co-integrated if they share a common stochastic drift. Granger and Engle developed a simple test for this.

Say we have a time series such that depends on all previous values of . Usually we see this expressed as the linear relationship

+ residual

although we can express this more generally,

where is any linear or possibly non-linear relationship. We then ask what causes ? Or, rather, if we have another time series , we ask

If we naively regress against , we might find a correlation when none exists.

*“A Causation is not a Correlation!”*

Instead, we propose a relationship between and , such that

and compare the 2 error functionals:

**We say that **

* causes … in the Granger Sense …*

* when* is much smaller than ** **

Another way of saying this, which might be more familiar to Machine Learning practitioners, is that we say causes when the future values of can be better predicted with the histories of both and than just alone.

A classic co-integration test is the 2-step Granger test. Here, we test if linear combination of and is stationary.

- form a new time series , which is the difference of the two. this is usually done with some suitable linear regression:
- apply the Augmented Dickey Fuller (ADF) test on the residuals to see if it is stationary time series.

The ADF test estimates the regression coefficient of on :

If the coefficient , the residuals are not stationary and we have a co-integrated process; if it is close to 1, the residuals are a stationary and we are not co-integrated. The ADF test is readily available in Matlab[3] and Python [4].

There is currently an open Kaggle contest on this very subject

*Given samples from a pair of variables A, B, find whether A is a cause of B.*

http://www.kaggle.com/c/cause-effect-pairs

*We provide hundreds of pairs of real variables with known causal relationships from domains as diverse as chemistry, climatology, ecology, economy, engineering, epidemiology, genomics, medicine, physics. and sociology. Those are intermixed with controls (pairs of independent variables and pairs of variables that are dependent but not causally related) and semi-artificial cause-effect pairs (real variables mixed in various ways to produce a given outcome).*

it is a bit late to enter, but if anyone wants to make a last hour contribution using one of these methods please feel free to contact me for a collaboration.

Having laid out the basic framework, the next steps are to look at the problem, from the point of of view of non-equilibirum statistical mechanics, and see if any of these old physics ideas are useful in the modern world of machine learning. stay tuned (and anyone wanting to do the contest please contact me)

[1] Engle, Robert F., Granger, Clive W. J. (1987) “Co-integration and error correction: Representation, estimation and testing“, *Econometrica*, 55(2), 251–276.

[2] D.Hsu and M. Hsu (2009) Zwanzig-Mori projection operators and EEG dynamics: deriving a simple equation of motion

[3] MatLab Econometrics Module

[5] Didier Sornette and Wei-Xing Zhou (2004) Non-Parametric Determination of Real-Time Lag Structure between Two Time Series: The ‘Optimal Thermal Causal Path’ Method

[6] High Frequency Statistical Arbitrage Via the Optimal Thermal Causal Path