When Regularization Fails

Another Holiday Blog:  Feedback and Questions are very welcome.

I have a client who is using Naive Bayes pretty successfully, and the subject has come up as to ‘why do we need fancy machine learning methods?’   A typical question I am asked is:

Why do we need Regularization ?

What happens if we just turn the regularizer off ?  

Can we just set it to a very small, default value ?

After all, we can just measure what customers are doing exactly.  Why not just reshow the most popular items, ads, etc to them. Can we just run a Regression?  Or Naive Bayes ?  Or just use the raw empirical estimates?

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

Sure, if you are simply estimating historical frequencies,  n-gram statistics, etc, then a simple method like Naive Bayes is fine. But to predict the future..

the Reason we need Regularization is for Generalization.  

After all, we don’t just want to make the same recommendations over and over.

chuckUnless you are just in love with ad re-targeting (sic).   We want to predict on things we have never seen.

And we need to correct for presentation bias when collecting data on things we have seen.

Most importantly, we need methods that don’t break down unexpectedly.

In this post, we will see why Regularization is subtle and non-trivial even in seemingly simple linear models like the classic Tikhonov Regularization.   And something that is almost never discussed in modern classes.

We demonstrate that ‘strong’  overtraining is accompanied by a Phase Transition–and optimal solutions lies just below this.

The example is actually motivated by something I saw recently in my consulting practice–and it was not obvious at first.

 

Still,  please realize–we are about to discuss a pathological case hidden in the dark corners of Tikhonov Regularization–a curiosity for mathematicians.  

See this Quora post on SVD in NLP for some additional motivations.

Let’s begin when Vapnik [1], Smale [2], and the other great minds of the field begin:

(Regularized) Linear Regression:

The most basic method in all statistics is linear regression

\mathbf{X}\mathbf{w}=\mathbf{y} .

Gauss_banknote

It is attributed to Gauss, one of the greatest mathematicians ever.

One solution is to assume Gaussian noise and minimize the error.

The solution can also be constructed using the Moore-Penrose PseudoInverse.

If we multiply by  \mathbf{X}^{\dagger} to the left on both sides, we obtain

\mathbf{X}^{\dagger}\mathbf{X}\mathbf{w}=\mathbf{X}^{\dagger}\mathbf{y}

The formal solution is

\mathbf{\hat{w}}=(\mathbf{X}^{\dagger}\mathbf{X})^{-1}\mathbf{X}^{\dagger}\mathbf{y}

which is only valid when we can actually invert the data covariance matrix \mathbf{X}^{\dagger}\mathbf{X} .

We say the problem is well-posed when (\mathbf{X}^{\dagger}\mathbf{X})^{-1}

  1.     exists,  
  2.   is unique, and
  3. is stable. 

Otherwise we say it is singular, or ill-posed [1].

In nearly all practical methods, we would not compute the inverse directly, but use some iterative technique to solve for \mathbf{\hat{w}}. In nearly all practical cases, even when the matrix seems invertible, or even well posed, it may have numerical instabilities, either due to the data or the numerical solver.  Vapnik refers to these as stochastically ill posed problems [4]

Frequently XTX can not be inverted..or should not be inverted..directly.

Tikhonov Regularization

A  trivial solution is to simply ‘regularize’ the matrix \mathbf{X}^{\dagger}\mathbf{X} by adding a small, non-zero value to the diagonal:

(\mathbf{X}^{\dagger}\mathbf{X})^{-1}\rightarrow(\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})^{-1} .

Naively, this seems like it would dampen instabilities and allow for a robust numerical solution.  And it does…in most cases.

If you want to sound like you are doing some fancy math, give it a Russian name; we call this Tikhonov Regularization [4]:

\mathbf{\hat{w}}(\lambda)=(\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})^{-1})\mathbf{X}^{\dagger}\mathbf{y}

crossvalidationThis is (also) called Ridge Regression in many common packages such as scikit learn.

The parameter \lambda is the regularization parameter values, usually found with Cross Validation (CV).  While a data science best practice, CV can fail !

Grid searching \lambda , we can obtain the infamous L-curve:

Screen Shot 2015-12-27 at 10.02.54 PM

The L-curve is a log-log plot of the the norm of the regularized solution vs. the norm of the residual.  The best \lambda  lies somewhere along the bottom of the L, but we can’t really tell where (even with cross validation).

This challenge of regularization is absent from basic machine learning classes?

The optimal \lambda is actually hard to find.   Cross Validation (CV) only works in ideal cases.  And we need a CV metric, like R^{2} , which also only works in ideal cases.

We might naively imagine \lambda can just be very small–in effect, turning off the regularizer.  This is seen in the curve above, where the solution norm diverges.

The regularization fails in these 2 notable cases, when

  1.  the model errors are correlated,  which fools simple cross validation
  2.  \lambda\rightarrow 0 and the number of features ~ the number of training instances, which leads to a phase transition. 

Today we will examine this curious, spurious behavior in case 2 (and look briefly at 1)

Regimes of Applications

strangelove

As von Neumann said once, “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk”

To understand where the regularization can fail, we need to distinguish between the cases in which Ridge Regression is commonly used.

Say we have N_{f} features, and N_{I} instances.  We distinguish between the High Bias and High Variance regimes.

 

High Variance: N_{f}\gg N_{I}

This regime is overdetermined, complicated models that are subject to overtraining.   We find multiple solutions which satisfy our constraints.

Most big data machine learning problems lie here.  We might have 100M documents and 100K words.  Or 1M images and simply all pixels as the (base) features.

Regularization lets us pick the best of solution which is the most smooth (L2 norm), or the most sparse (L1 norm), or maybe even the one with the least presentation bias (i.e. using Variance regularization to implement Counterfactural Risk Minimization

High Bias regime: N_{f}\ll N_{I}

This is the Underdetermined regime, and any solution we find, regularized or not, will most likely  generalize poorly.

When we have more features than instances, there is no solution at all (let alone a unique one)

matrix

Still, we can pick a regularizer, and the effect is similar to dimensional reduction. Tikhonov Regularization is similar truncated SVD (explained below)

Any solution we find may work, but the predictions will be strongly biased towards the training data.

But it is not only just sampling variability that can lead to poor predictions.

In between these 2 limits is an seemingly harmless case, however, this is really a

Danger Zone:  N_{f}\approx N_{I}

danger-will-robinsonThis is a rare case where the number of features = the number of instances.

This does come up in practical problems, such as classifying a large number of small text phrases.  (Something I have been working on with one of my larger clients)

phrase-cloud

The number of phrases ~ the number of unique words.

This is a dangerous case … not only because it seems so simple … but because the general theory breaks down.  Fortunately, it is only pathologicial in

The limit of zero regularization

We examine how these different regimes behave for small values of \lambda

\underset{\lambda\rightarrow 0}{\lim}\;(\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})^{-1}\mathbf{X}^{\dagger}\mathbf{y}

Let’s formalize and consider the how the predicted accuracy behaves.

The Setup:  An Analytical Formulation

We analyze our regression under Gaussian noise \mathbf{\xi} .  Let

y=\mathbf{X}\mathbf{w}*+\mathbf{\xi}

where \mathbf{\xi} is a Gaussian random variable with unit mean and variance \sigma^{2}

\mathbf{\xi}=N(0,\sigma^{2})

This simple model lets us analyze predicted Accuracy analytically.

Let

\mathbf{\hat{w}} be the optimal Estimate, and

\mathbf{w^{*}} be the Ground Truth, and

\mathbf{\bar{w}}=\mathbb{E}_{\xi}[\mathbf{\hat{w}}]  be expected (mean).

We would like define, and the decompose, the Generalization Accuracy into Bias and Variance contributions.

Second, to derive the Generalization Error, we need to work out how the estimator behaves as a random variable. Frequently, in Machine Learning research, one examines the worst case scenario.  Alternatively, we use the average case.

Define the Estimation Error

Err(\mathbf{\hat{w}})=\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{w^{*}}\Vert^{2}

Notice that by Generalization Error, we usually we want to know how our model performs on a hold set or test point \mathbf{x} :

\mathbb{E}_{\mathbf{\xi}}[(\mathbf{x^{\dagger}}(\hat{\mathbf{w}}-\mathbf{w^{*}}))^{2}]

In the Appendix, we work out the generalization bounds for the worst case and the average case.  From here on, however, when we refer to Generalization Error, we mean the average case Estimation Error (above).

Bias-Variance Tradeoff

For Ridge Regression, the mean estimator is

\mathbf{\bar{w}}=(\mathbf{X^{\dagger}X}-\lambda\mathbf{I})^{-1}\mathbf{X^{\dagger}X}\mathbf{w^{*}}

and its variation is

\mathbf{\hat{w}}-\mathbf{\bar{w}}=(\mathbf{X^{\dagger}X}-\lambda\mathbf{I})^{-1}\mathbf{X^{\dagger}}\mathbf{\xi}

We can now decompose the Estimation Error into 2 terms:

\mathbb{E}_{\xi}=\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}+\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert^{2}\;\;,  where

\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert is the Bias, and

\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}  is the Variance,

and examine each regime in the limit \lambda\rightarrow 0

The Bias is just the error of the average estimator:

\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=\lambda^{2}\Vert((\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})\mathbf{w^{*}}\Vert^{2}

and the Variance is the trace of the Covariance of the estimator

\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}=\sigma^{2}\;Tr\bigg(\dfrac{\mathbf{X}^{\dagger}\mathbf{X}}{(\mathbf{X}^{\dagger}\mathbf{X}+\lambda\mathbf{I})^{2}}\bigg)

We can examine the limiting behavior of these statistics by looking at the leading terms in a series expansion for each. Consider the Singular Value Decomposition (SVD) of X

 \mathbf{X}=\mathbf{USV} .

Let \{s_{1},s_{2},\cdots,s_{m}\} be the positive singular values, and \{\mathbf{v}_{i}\} be the right singular vectors.

We can write the Bias as

\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=\sum\limits_{1}^{N_{f}}\big(\dfrac{\lambda\mathbf{v}_{i}^{\dagger}\mathbf{w}^{*}}{(s_{i}^{2}+\lambda)}\big)^{2}

With no regularization, and the number of features exceeds the number of instances, there is a strong bias, determined by the ‘extra’ singular values

\lim\limits_{\lambda\rightarrow 0}\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=\sum\limits_{N_{i}+1}^{N_{f}}(\mathbf{v}_{i}^{\dagger}\mathbf{w}^{*})^{2}\;\;when\;\;N_{f}\gg N_{I}

Otherwise, the Bias vanishes.

\lim\limits_{\lambda\rightarrow 0}\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=0\;\;when\;\;N_{f}\le N_{I}

So the Bias appears to only matter in the underdetermined case.

(Actually, this is misleading; in the overdetermined case, we can introduce a bias when tuning the regularization parameter too high)

In contrast, the Variance

\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}=\sigma^{2}\sum\limits_{1}^{m}\dfrac{s_{i}^{2}}{(s^{2}_{i}+\lambda)^{2}}

\lim\limits_{\lambda\rightarrow 0}\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}=\sum\limits_{1}^{m}s_{i}^{-2}

can diverge if the minimal singular value is small:

\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}\rightarrow\infty \;\;when\;\; s_{0}\sim 0

That is,  if the singular values decrease too fast, the variance can explode.

And this can happen in the danger zone

When N_{f}\sim N_{i} the variance can be infinite! 

Which also means the Central Limit Theorem (CLT) breaks down…at least how we normally use it.

Traditionally, the (classical) CLT says that the infinite sum of i.i.d. random variables \{x_{i}\}  converges to a Gaussian distribution–when the variance of the {x_{i}} is finite.  We can, however, generalize the CLT to show that, when the variance is infinite, the sum converges to a Levy or \alpha -stable, power-law distribution.

These power-law distributions are quite interesting–and arise frequently in chemistry and physics during a phase transition. But first, let’s see the divergent behavior actually arise.

When More is Less:  Singularities

Ryota Tomioka [3] has worked some excellent examples using standard data sets

Screen Shot 2015-12-17 at 4.01.56 PM

[I will work up some ipython notebook examples soon]

The Spambase dataset is a great example.  It has N_{f}=57 features, and N_{i}=4601 instances. We run a Ridge Regression, with N_{i}\in(0,4601] , with \lambda=10^{-6} .

As the data set size N_{f} increases, the accuracy sharps to zero at N_{f}=N_{f}=57 , and then increases sharply again, saturating at $latex N_{f}\sim 10^{3} $.

Many other datasets show this anomalous behavior at N_{f}=N_{f} , as predicted by our analysis above.

Ridge Regression vs Logistic Regression: Getting out of the Hole

Where Ridge Regression fails, Logistic Regression bounces back.  Below we show some examples of RR vs LR in the N_{f}=N_{f} danger zone

 

Screen Shot 2015-12-17 at 4.40.13 PM

Logistic Regression still has problems, but it no where near as pathological as RR.

The reason for this is subtle.

  • In Regression, the variance goes to infinity
  • In Classification, the norm \Vert\mathbf{\hat{w}}\Vert goes to infinity

Traditional ML Theory:  Isn’t the Variance Bounded?

Traditional theory (very losely) bounds for quantities like the generalization error and (therefore) the variance.

We work out some of examples in the Appendix.

Clearly these bounds don’t work in all cases.  So what do they really tell us?  What’s the point?

These theories gives us some confidence that we can actually apply a learning algorithm–but only when the regularizer is not too small and the noise is not too large.

They essentially try to get us far away from the pathological phase transition that arises when the variance diverges.

Cross Validation and Correlated Errors

In practice, we would never just cross our fingers set \lambda=0.00000001 .

We pick a hold out set and run Generalized Cross Validation (GCV) . Yet this can fail when

  1.  the model errors \xi are not i.i.d. but are strongly correlated
  2.  we don’t know what accuracy metric to use?

Hansen [5, 7] has discussed these issue in detail…and he provides an alternative method, the L-curve, which attempt to balance the size of the regularized solution and the residuals.

Screen Shot 2015-12-27 at 11.16.54 PM

For example, we can sketch the relative errors \dfrac{\Vert w-w(\lambda)\Vert}{\Vert w\Vert_{2}} for the L-curve and GVC method (see 7).  Above we see that the L-curve relative errors are more Gaussian than GCV.

In other words, while most packages provide a small choice of regression metrics; as data scientists, the defaults like R^{2} may not be good enough.  Even using the 2-norm may not represent the errors well. As data scientists, we may need to develop our own metrics to get a good \lambda for a regression.  (or maybe just use an SVM regression).

Hansen claims that,”experiments confirm that whenever GCV finds a good regularization parameter, the corresponding solution is located at the corner of the L-curve.” [7]

This means that the optimal regularized solution lives ‘just below’ the phase transition, where the norm diverges.

Phase Transitions in Regularized Regression

cliff

What do we mean by a phase transition?

The simplest thing is to we plot a graph and look a steep cliff or sharp change.

Here, generalization accuracies drops off suddenly as \dfrac{N_{f}}{N_{i}}\rightarrow 1\;\;and\;\;\alpha\rightarrow 0

In a physical phase transition,the fluctuations (i.e. variances and norms) in the system approach infinity.

Imagine watching a pot boil

4418437-Glass-saucepan-on-the-gas-stove-close-up-Stock-Photo-water-boiling-pot.jpg

We see bubbles of all sizes, small and large.  The variation in bubble size is all over the map.  This is characteristic of a phase transition: i.e. water to gas.

When we cook, however, we frequently simmer our foods–we keep the water just below the boiling point.  This is as hot as the water can get before it changes to gas.

Likewise, it appears that in Ridge Regression, we frequently operate at a point just below the phase transition–where the solution norm explodes.  And this is quite interesting.  And, I suspect, may be important generally.

In chemical physics, we need special techniques to treat this regime, such as the Renormalization Group. Amazingly, Deep Learning / Restricted Boltzmann Machines  look very similar to the Variational Renormalization Group method.

In our next post, we will examine this idea further, by looking at the phase diagram of the traditional Hopfield Neural Networ, and the idea of Replica Symmetry Breaking in the Statistical Mechanics of Generalization. Stay tuned !

References

1. Vapnik and Izmailov, V -Matrix Method of Solving Statistical Inference Problems, JMLR 2015

2. Smale, On the Mathematical Foundations of Learning, 2002

3. https://github.com/ryotat

4. http://ttic.uchicago.edu/~gregory/courses/LargeScaleLearning/lectures/proj_learn1.pdf

5.  The L-curve and its use in the numerical treatment of inverse problems

6.  Discrete Ill-Posed and Rank­ Deficient Problems

7.  The use of the L-curve in the regularization of discrete ill-posed problems, 1993

Appendix:  some math

Bias-Variance Decomposition

\mathbf{y}=\mathbf{X}{w}+\mathbf{\xi} ,

when \mathbb{E}[\mathbf{\xi}]=0 ,

then Cov(\mathbf{\xi})=\sigma^{2}\mathbf{I}

Let

\mathbf{\Sigma}=\dfrac{1}{N_{i}}\mathbf{X^{\dagger}X}

be the data covariance matrix.  And , for convenience, let

\lambda_{n}=\dfrac{\lambda}{n} .

The Expected value of the optimal estimator, assuming Gaussian noise, is given using the Penrose Pseudo-Inverse

\mathbb{E}\mathbf{\hat{x}}=(\mathbf{\Sigma}-\lambda_{n}\mathbf{I})^{-1}\mathbf{\Sigma}\mathbf{w^{*}}

The Covariance is

Cov(\mathbf{\hat{x}})=\dfrac{\sigma^{2}}{n}(\mathbf{\Sigma}-\lambda_{n}\mathbf{I})^{-1}\mathbf{\Sigma}(\mathbf{\Sigma}-\lambda_{n}\mathbf{I})^{-1}

The regularized Covariance matrix arises so frequently that we will assign it a symbol

\mathbf{\Sigma}_{\lambda}=(\mathbf{\Sigma}-\lambda_{n}\mathbf{I})^{-1}

We can now write down the mean and covariance

\mathbb{E}_{\xi}\Vert\hat{w}-\bar{w}\Vert^{2}=Tr(Cov(\mathbf{\hat{w}}))

[more work needed here]

Average and Worst Case Generalization Bounds

Consider the Generalization Error for a test point \mathbf{x}

\mathbb{E}_{\mathbf{\xi}}[(\mathbf{x^{\dagger}}(\hat{\mathbf{w}}-\mathbf{w^{*}}))^{2}] =\lambda^{2}_{n}[\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}]^{2}+\dfrac{\sigma^{2}_{n}}{n}\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{\Sigma}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}

We can now obtain some very simple bounds on the Generalization Accuracy (just  like a bona-a-fide ML researcher)

The worst case bounds

[\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}]^{2}\le\Vert\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{x}\Vert^{2}\Vert\mathbf{w^{*}}\Vert^{2}

The average case bounds

\mathbb{E}_{\mathbf{w^{*}}}[\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}]^{2}\le\alpha^{-1}\Vert\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{x}\Vert^{2} ,

where we assume the ‘covariance*’ is

\mathbb{E}_{\mathbf{w^{*}}}[\mathbf{w^{*}w^{*\dagger}}]=\alpha^{-1}\mathbf{I}

Discrete Picard Condition

We can use the SVD approach to make stronger statements about our ability to solve the inversion problem

\mathbf{Xw}=\mathbf{y} .

I will sketch an idea called the “Discrete Picard Condition” and provide some intuition

Write \mathbf{X} as a sum over singular vectors

\mathbf{X}=\mathbf{U\Sigma V^{\dagger}}=\sum\limits_{i=1}^{N}\mathbf{u_{i}}\sigma_{i}\mathbf{v_{i}^{\dagger}} .

Introduce an abstract PseudoInverse,  defined by

\mathbf{X}^{-\dagger}\mathbf{X}=\mathbf{I}

Express the formal solution as

\mathbf{w}=\mathbf{X}^{-\dagger}\mathbf{y}

Imagine the SVD of the PseudoInverse is

\mathbf{X}^{-\dagger}=\mathbf{U^{\dagger}\Sigma^{-1}V}=\sum\limits_{i=1}^{N}\mathbf{u_{i}^{\dagger}}\sigma^{-1}_{i}\mathbf{v_{i}} .

and use it to obtain

\mathbf{w}=\sum\limits_{i=1}^{N}\dfrac{\mathbf{u^{\dagger}_{i}y}}{\sigma_{i}}\mathbf{v_{i}} .

For this expression to be meaningful, the SVD coefficients \mathbf{u^{\dagger}_{i}b} must decay, on average, faster than the corresponding singular values \sigma{i} .

Consequently, the only coefficients that carry any information are larger than the noise level.  The small coefficients have small singular values and are dominated by noise.    As the problem becomes more difficult to solve uniquely, the singular values decay more rapidly.

If we discard the k+1,N singular components,  we obtain a regularized, unsupervised solution called

Truncated SVD

\mathbf{w}=\sum\limits_{i=1}^{k}\dfrac{\mathbf{u^{\dagger}_{i}y}}{\sigma_{i}}\mathbf{v_{i}} .

which is similar in spirit to spectral clustering.

 

Why Deep Learning Works II: the Renormalization Group

Deep Learning is amazing.  But why is Deep Learning so successful?  Is Deep Learning just old-school Neural Networks on modern hardware?  Is it just that we have so much data now the methods work better?  Is Deep Learning just a really good at finding features. Researchers are working hard to sort this out.

Recently it has been shown that [1]

Unsupervised Deep Learning implements the Kadanoff Real Space Variational Renormalization Group (1975)

This means the success of Deep Learning is intimately related to some very deep and subtle ideas from Theoretical Physics.  In this post we examine this.

Unsupervised Deep Learning: AutoEncoder Flow Map

An AutoEncoder is a Unsupervised Deep Learning algorithm that learns how to represent an complex image or other data structure X .   There are several kinds of AutoEncoders; we care about so-called Neural Encoders–those using Deep Learning techniques to reconstruct the data:

recon

The simplest Neural Encoder is a Restricted Boltzman Machine (RBM).  An RBM is non-linear, recursive, lossy function f(X) that maps the data X from visible nodes {v}  into hidden nodes {h} :

rbm

The RBM is learned by selecting the optimal parameters {b_{v}},{c_{h}},{w_{v,h}} that minimize some measure of the reconstruction error (see Training RBMs, below)

\min |\Vert f(X)-X\Vert

RBMs and other Deep Learning algos are formulated using classical Statistical Mechanics.  And that is where it gets interesting!

Multi Scale Feature Learning

In machine learning (ML), we map (visible) data into (hidden) features

\mathbf{v}(X)\rightarrow\mathbf{h}

The hidden units discover features at a coarser grain level of scale

unsupervised filters

With RBMs, when features are complex, we may stack them into a Deep Belief Network (DBM), so that we can learn at different levels of scale

dbn

and leads to multi-scale features in each layer

dbn-features

Deep Belief Networks are a Theory of Unsupervised MultiScale Feature Learning

Fixed Points and Flow Maps

We call f(X)  a flow map

f(X)\rightarrow X

eIf we apply the flow map to the data repeatedly, (we hope) it converges to a fixed point

\lim\limits_{n}f^{n}(f^{n-1}(\cdots(f^{1}(f^{0}(X))))\rightarrow f_{\infty}(X)

Notice that we usually expect to apply the same map each time f^{n}(x)=f(x) , however, for a computational theory we may need more flexibility.

Example: Linear Flow Map

The simplest example of a flow map is the simple linear map

X\rightarrow CX

so that

f(X)\sim CX

where C is a non-negative, low rank matrix

\mid C\mid\,\ll\,\mid X\mid

We have seen this before: this leads to a Convex form of NonNegative Matrix Factorization NMF

Convex-Fig1

Convex NMF applies when we can specify the feature space and where the data naturally clusters.  Here, there are a few instances that are archetypes that define the convex hull of the data.

\{\mathbf{x}_{c}\}\in X\,,c=1,\cdots,\mid C\mid

Amazingly, many clustering problems are provably convex–but that’s a story for another post.

Example: Manifold Learning

Near a fixed point, we commonly approximate the flow map by a linear operator

f_{\infty}(X) \sim\mathbf{L}(X)

This lets us capture the structure of the true data manifold, and is usually described by the low lying eigen-spectra of

\mathbf{L}(X)=\sum\limits_{i=1}\lambda_{i}\hat{\mathbf{v}}_{i}\sim\lambda_{0}\hat{\mathbf{v}}_{0}+\lambda_{1}\hat{\mathbf{v}}_{1}+\cdots.

In the same spirit,  Semi & Unsupervised Manifold Learning, we model the data using a Laplacian operator \mathbf{L}(\sigma) , usually parameterized by a single scale parameter \sigma .

manifold

These methods include Spectral Clustering, Manifold Regularization , Laplacian SVM, etc.  Note that manifold learning methods, like the Manifold Tanget Classifier,  employ Contractive Auto Encoders and use several scale parameters to capture the local structure of the data manifold.

The Renormalization Group

In chemistry and physics, we frequently encounter problems that require a multi-scale description.   We need this for critical points and phase transitions, for natural crashes like earthquakes and avalanches,  for polymers and other macromolecules, for strongly correlated electronic systems, for quantum field theory, and, now, for Deep Learning.

A unifying idea across these systems is the Renormalization Group (RG) Theory.

Renormalization Group Theory is both a conceptual framework on how to think about physics on multiple scales as well as a technical & computational problem solving tool.

kenwilsonKen Wilson won the 1982 Nobel Prize in Physics for the development and application of his Momentum Space RG theory to phase transitions.

We used RG theory to model the recent BitCoin crash as a phase transition.

Wilson invented modern multi-scale modeling; the so-called Wilson Basis was an early form of Wavelets.  Wilson was also a big advocate of using supercomputers for solving problems.  Being a Nobel Laureate, he had great success promoting scientific computing.  It was thanks to him I had access to a Cray Y-MP when I was in high school because he was a professor at my undergrad, The Ohio State University.

 Here is the idea.  Consider a feature map  which transforms the data X to a different, more coarse grain scale

x\rightarrow\phi_{\lambda}(x)

The RG theory requires that the Free Energy F(x) is rescaled, to reflect that

the Free Energy is both Size-Extensive and Scale Invariant near a Critical Point

This is not obvious — but it is essential to both having a conceptual understanding of complex, multi scale phenomena, and it is necessary to obtain very highly accurate numerical calculations.  In fact, being size extensive and/or size consistent is absolutely necessary for highly accurate quantum chemistry calculations of strongly correlated systems.  So it is pretty amazing but perhaps not surprising that this is necessary for large scale deep learning calculations also!

The Fundamental Renormalization Group Equation (RGE)

\mathcal{F}(x)=g(x)+\dfrac{1}{\lambda}\mathcal{F}(\phi_{\lambda}(x))

If we (can) apply the same map, F(x) , repeatedly, we obtain a RG recursion relation, which is the starting point for most analytic work in theoretical physics.   It is usually difficult to obtain an exact solution to the RGE (although it is illuminating when possible [20]).

Many RG formulations both approximate the exact RGE and/or only include relevant variables. To describe a multiscale system, it is essential to distinguish between these relevant and irrelevant variables.

Example: Linear Rescaling

Let’s say the feature map is a simple linear rescaling

\phi(x)=\lambda x

We can obtain a very elegant, approximate RG solution where F(x) obeys a complex (or log-periodic) power law.

\mathcal{F}(x)\sim x^{-(\alpha+i\beta)}

This behavior is thought to characterize Per-Bak style Self-Organized Criticality (SOC), which appears in many natural systems–and perhaps even in the brain itself.   Which leads to the argument that perhaps Deep Learning and Real Learning work so well because they operate like a system just near a phase transition–also known as the Sand Pile Model--operating at a state between order and chaos.

the Kadanoff Variational Renormalization Group (1975)

kadanoffLeo Kadanoff, now at the University of Chicago, invented some of the early ideas in Renormalization Group.  He is most famous for the Real Space formulation of RG, sometimes called the Block Spin approach.  He also developed an alternative approach, called the Variational Renormalization Group (VRG, 1975), which is, remarkably, what Unsupervised RBMs are implementing!

Let’s consider a traditional Neural Network–a Hopfield Associative Memory (HAM).  This is also known as an Ising model or a Spin Glass in statistical physics.

An HAM consists of only visible units; it stores memories explicitly and directly in them:

hopfield

We specify the Energy — called the Hamiltonian \mathcal{H} — for the nodes.  Note that all the nodes are visible.  We write

\mathcal{H}^{HAM}=-\sum\limits_{i}B_{i}v_{i}-\sum\limits_{i}J_{i,j}v_{i}v_{j}

The Hopfield model has only single B_{i} and pair-wise J_{i,j} interactions.

A general Hamiltonian might have many-body, multi-scale interactions:

\mathcal{H}(v)=-\sum\limits_{i}K_{i}v_{i}-\sum\limits_{i}K_{i,j}v_{i}v_{j}-\sum\limits_{i,j,k}K_{i,j,k}v_{i}v_{j}v_{k}-\cdots

The Partition Function is given as

\mathcal{Z}=\sum\limits_{v}e^{-\mathcal{H}(v)}

And the Free Energy is

\mathcal{F}^{v}=-\ln\mathcal{Z}=-\ln\sum\limits_{v}e^{-\mathcal{H}(v)}

The idea was to mimic how our neurons were thought to store memories–although perhaps our neurons do not even do this.

Either way, Hopfield Neural Networks have many problems; most notably they may learn spurious patterns that never appeared in the training set. So they are pretty bad memories.

Hinton created the modern RBM to overcome the problems of the Hopfield model.  He used hidden units to represent the features in the data–not to memorize the data examples directly.

rbm2

An RBM is specified Energy function for both the visible and hidden units

\mathbf{E}(v,h)=\mathbf{v}^{t}\mathbf{b}+\mathbf{v}^{t}\mathbf{W}\mathbf{h}+\mathbf{c}^{t}\mathbf{h}

This also defines joint probability of simultaenously observing a configuration of hidden and visible spins

P(v,h)=\dfrac{e^{-\mathbf{E(v,h)}}}{\mathcal{Z}}

which is learned variationally, by minimizing the reconstruction error…or the cross entropy (KL divergence), plus some regularization (Dropout), using Greedy layer-wise unsupervised training, with the Contrastive Divergence (CD or PCD) algo, …

The specific details of an RBM Energy are not addressed by these general concepts; these details do not affect these arguments–although clearly they matter in practice !

It turns out that

Introducing Hidden Units in a Neural Network is a Scale Renormalization.  

When changing scale, we obtain an Effective Hamiltonian \tilde{\mathcal{H}}  that acts on a the new feature space (i.e the hidden units)

\mathcal{H}(v)\rightarrow\tilde{\mathcal{H}}(h)

or, in operator form

\tilde{\mathcal{H}}(h)=\mathbb{R}[\mathcal{H}(v)]

This Effective Hamiltonian is not specified explicitly, but we know it can take the general form (of a spin funnel, actually)

\tilde{\mathcal{H}}(h)=-\sum\limits_{i}\tilde{K}_{i}h_{i}-\sum\limits_{i}\tilde{K}_{i,j}h_{i}h_{j}-\sum\limits_{i,j,k}\tilde{K}_{i,j,k}h_{i}h_{j}h_{k}\cdots

The RG transform preservers the Free Energy (when properly rescaled):

\mathcal{F}^{h}\sim\mathcal{F}^{v}

where

\mathcal{F}^{h}=-\ln\sum\limits_{v}e^{\mathcal{H}(h)}

\mathcal{F}^{v}=-\ln\sum\limits_{v}e^{-\tilde{\mathcal{H}}(h)}

Critical Trajectories and Renormalized Manifolds

The RG theory provides a way to iteratively update, or renormalize, the system Hamiltonian.  Each time we add a layer of hidden units (h1, h2, …), we have

\mathcal{H}^{RG}(\mathbf{v},\mathbf{h1})=\mathbb{R}[\mathcal{H}(\mathbf{v})]

\mathcal{H}^{RG}(\mathbf{v},\mathbf{h1},\mathbf{h2})=\mathbb{R}[\mathbb{R}[\mathcal{H}(\mathbf{v})]

\cdots

We imagine that the flow map is attracted to a Critical Trajectory which naturally leads the algorithm to the fixed point.  At each step, when we apply another RG transform, we obtain a new, Renormalized Manifold, each one closer to the optimal data manifold.

graph_fisher

Conceptually, the RG flow map is most useful when applied to critical phenomena–physical systems and/or simple models that undergo a phase transition.  And, as importantly, the small changes in the data should ‘wash away’ as noise and not affect the macroscopic / critical phenomena. Many systems–but not all–display this.

Where Hopfield Nets fail to be useful here, RBMs and Deep Learning systems shine.

We now show that these RG transformations are achieved by stacking RBMs and solving the RBM inference problem!

Kadanoff’s Variational Renormalization Group

As in many physics problems, we break the modeling problem into two parts:  one we know how to solve, and one we need to guess.

  1. we know the Hamiltonian at the most fine grained level of scale  \mathcal{H}(v)
  2. we seek the correlation \mathbf{V}(v,h) that couples to the next level scale

The joint Hamiltonian, or Energy function, is then given by

\mathcal{H}(\mathbf{v,h})=\mathcal{H}(\mathbf{v})+\mathbf{V(v,h)}

The Correlation V(v,h) is defined so that the partition function \mathcal{Z}  is not changed

\sum\limits_{h}e^{-\mathbf{V}\mathbf{(v,h})}=1

This gives us

\mathcal{Z}=\sum_{v}e^{-\mathcal{H}(v)}=\sum\limits_{v}\sum\limits_{h}e^{-\mathbf{V}(v,h)}e^{-\mathcal{H}(v)}

(Sometimes the Correlation V is called a Transfer Operator T, where V(v,h)=-T(v,h) )

We may now define a renormalized effective Hamilonian \tilde{\mathcal{H}(h)} that acts only on the hidden nodes

\tilde{\mathcal{H}}(h)=\ln\sum\limits_{v}e^{-\mathbf{V}(v,h)}e^{-\mathcal{H}(v)}

so that we may write

\mathcal{Z}=\sum\limits_{h}e^{-\tilde{\mathcal{H}}(h)}

Because the partition function does not change, the Exact RGE preserves the Free Energy (up to a scale change, we we subsume into \mathcal{Z})

\Delta\tilde{\mathcal{F}}=\tilde{\mathcal{F}}^{h}-\mathcal{F}^{v}=0

We generally can not solve the exact RGE–but we can try to minimize this Free Energy difference.

What Kadanoff showed, way back in 1975, is that we can accurately approximate the Exact Renormalization Group Equation by finding a lower bound using this formalism

Deep learning appears to be a real-space variational RG technique, specifically applicable to very complex, inhomogenous systems where the detailed scale transformations have to be learned from the data

RBMs expressed using Variational RG

We will now show how to express RBMs using the VRG formalism and provide some intuition

In an RBM, we simply want to learn the Energy function directly; we don’t specify the Hamiltonian for the visible or hidden units explicitly, like we would in physics.  The RBM Energy is just

\mathbf{E}^{RBM}(v,h)=\mathcal{H}^{RBM}(v)-\mathbf{V}(v,h)

We identify the Hamiltonian for the hidden units as the Renormalized Effective Hamiltonian from the VRG theory

\mathbf{H}^{RBM}(h)=\hat{\mathcal{H}}(h)

RBM Hamiltonians / Marginal Probabilities

To obtain RBM Hamiltonians for just the visible \mathcal{H}^{RBM}(v) or hidden \mathcal{H}^{RBM}(h) nodes, we need to integrate out the other nodes; that is, we need to find the marginal probabilities.

P(v)=\sum\limits_{h}P(v,h)=\dfrac{e^{-\mathcal{H}^{RBM}(v) }}{\mathcal{Z}}=\dfrac{1}{\mathcal{Z}} \sum\limits_{h}e^{-\mathbf{E(v,h)}}

\mathcal{H}^{RBM}(v)=-\ln\sum\limits_{h}e^{-\mathbf{E(v,h)}}

and

P(h)=\sum\limits_{v}P(v,h)=\dfrac{e^{-\mathcal{H}^{RBM}(h) }}{\mathcal{Z}}=\sum\limits_{v}\dfrac{e^{-\mathbf{E(v,h)}}}{\mathcal{Z}}

\mathcal{H}^{RBM}(h)=-\ln\sum\limits_{v}e^{-\mathbf{E(v,h)}}

Training RBMs

To train an RBM, we apply Contrastive Divergence (CD), or, perhaps today, Persistent Contrastive Divergence (PCD).  We can kindof think of this as slowly approximating

\dfrac{\partial}{\partial\theta}\ln\mathcal{Z}(\theta)

In practice, however, RBM training minimizes the associated Free Energy difference \Delta\mathbf{F} … or something akin to this…to avoid overfitting.

In the “Practical Guide to Training Restricted Boltzmann Machines”, Hinton explains how to train an RBM (circa 2011).  Section 6 addresses “Monitoring the overfitting”

“it is possible to directly monitor the overfitting by comparing the free energies of training data and held out validation data…If the model is not overfitting at all, the average free energy should be about the same on training and validation data”

Other Objective Functions

Modern variants of Real Space VRG are not  “‘forced’ to minimize the global free energy” and have attempted other approaches such as Tensor-SVD Renormalization.  Likeswise, some RBM / DBM approaches do likewise may minimize a different objective.

In some methods, we minimize the KL Divergence; this has a very natural analog in VRG language [1].

Why Deep Learning Works: Lessons from Theoretical Physics

The Renormalization Group Theory provides new insights as to why Deep Learning works so amazingly well.  It is not, however, a complete theory. Rather, it is framework for beginning to understand what is an incredibly powerful, modern, applied tool.  Enjoy!

References

[1] An exact mapping between the Variational Renormalization Group and Deep Learning, 2014

[2] Variational Approximations for Renormalization Group Transformations, 1976

[3]  A Common Logic to Seeing Cats and Cosmos

[4] WHY DOES UNSUPERVISED DEEP LEARNING WORK? – A PERSPECTIVE FROM GROUP THEORY, 2015

[5] A Fundamental Theory to Model the Mind

[6] A Practical Guide to Training Restricted Boltzmann Machines, 2010

[7] On the importance of initialization and momentum in deep learning, 2013

[8] Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014

[9] Ken Wilson, A Scientific Appreciation

[10] http://www-math.unice.fr/~patras/CargeseConference/ACQFT09_JZinnJustin.pdf

[11] Training Products of Experts by Minimizing Contrastive Divergence, 2002

[12] Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient, 2008

[13] http://www.nonlin-processes-geophys.net/3/102/1996/npg-3-102-1996.pdf

[14] THE RENORMALIZATION GROUP AND CRITICAL PHENOMENA, Ken Wilson Nobel Prize Lecture

[15] Scaling, universality, and renormalization: Three pillars of modern critical phenomena

[16] The Potential Energy of an Autoencoder, 2014

[17]  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction, 2011

[18] Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion, 2010

[19] Quora:  What is Renormalization group theory?

[20] Renormalization group and critical localization, 1977

Why does Deep Learning work?

Why does Deep Learning work?

This is the big question on everyone’s mind these days.  C’mon we all know the answer already:

“the long-term behavior of certain neural network models are governed by the statistical mechanism of infinite-range Ising spin-glass Hamiltonians” [1]   In other words, 

Multilayer Neural Networks are just Spin Glasses?  Right?

This is kinda true–depending on what you mean by a spin glass.

In a recent paper by LeCun, he attempts to extend our understanding of training neural networks by studying the SGD approach to solving the multilayer Neural Network optimization problem [1].   Furthermore, he claims

None of these works however make the attempt to explain the paradigm of optimizing the highly non-convex neural network objective function through the prism of spin-glass theory and thus in this respect our approach is very novel.  And, again, this is kinda true

But here’s the thing…we already have a good idea of what the Energy Landscape of multiscale spin glass models* look like–from early theoretical protein folding work by Peter Wolynes to modern work by Ken Dill, etc [2,3,4]).  In fact, here is a typical surface:

Energy Landscape of MultiScale Spin Glass Model

*[technically these are Ising spin models with multi-spin interactions]

Let us consider the nodes, which above represent partially folded states, as nodes in a multiscale spin glass–or , say, a multilayer neural network.  Immediately we see the analogy and the appearance of the ‘Energy funnel’ In fact, researchers have studied these ‘folding funnels’ of spin glass models over 20 years ago [2,3,4].  And we knew then that

as we increase the network size, the funnel gets sharper

3D Energy Landscape of the Folding Funnel of a Spin Glass

Note: the Wolynes protein-folding spin-glass model is significantly different from the p-spin Hopfield model that LeCun discusses because it contains multi-scale, multi-spin interactions.  These details matter.

Screen Shot 2015-06-21 at 9.28.24 AM

Spin glasses and spin funnels are quite different.  Spin glasses are highly non-convex with lots of local minima, saddle points, etc.  Spin funnels, however, are designed to find the spin glass of minimal-frustration, and have a convex, funnel shaped, energy landscape.

Spin Funnels have been characterized and cataloged by theoretical chemists and can take many different convex shapes

This seemed to be necessary to resolve one of the great mysteries of protein folding: Levinthal’s paradox [5].  If nature just used statistical sampling to fold a protein, it would take longer than the ‘known’ lifetime of the Universe.  It is why Machine Learning is not just statistics.

Spin Funnels and Spin Glasses
Spin Funnels (DFM) vs Spin Glasses (SG) [4]

Deep Learning Networks are (probably) Spin Funnels

So with a surface like this, it is not so surprising that an SGD method might be able to find the Energy minima (called the Native State in protein folding theory). We just need to jump around until we reach the top of the funnel, and then it is a straight shot down.  This, in fact, defines a so-called ‘folding funnel’ [4]

So is not surprising at all that SGD may work.

Recent research at Google and Stanford confirms that the Deep Learning Energy Landscapes appear to be roughly convex [6], as does LeCuns work on spin glasses.

Note that a real theory of protein folding, which would actually be able to fold a protein correctly (i.e. Freed’s approach [7]), would be a lot more detailed than a simple spin glass model.  Likewise, real Deep Learning systems are going to have a lot more engineering details–to avoid overtraining (Dropout, Pooling, Momentum) than a theoretical spin funnel.

 [Indeed, what is unclear is what happens at the bottom of the funnel.  Does the system exhibit a spin glass transition (with full blown Replica Symmetry Breaking, as LeCun suggests), or is the optimal solution more like a simple native state defined by but a few low-energy configurations ? Do DL systems display a phase transition, and is it first order like protein folding?  We will address these details in a future post.]  In any case case,

It is not that Deep Learning is non-convex–is that we need to avoid over-training

And it seems modern systems can do this using a few simple tricks, like Rectified Linear Units, Dropout, etc.

Hopefully we can learn something using the techniques developed to study the energy landscape of   multi-scale spin glass/ spin funnels models. [8,9], thereby utilizing methods  theoretical chemistry and condensed matter physics.

In a future post, we will look in detail at the folding funnel including details such as misfolding,  the spin glass transition in the funnel, and the relationship to symmetry breaking, overtraining, and LeCun’s very recent analysis.  These are non-trivial issues.

I believe this is the first conjecture that Supervised Deep Learning is related to a Spin Funnel.   In the next post, I will examine the relationship between Unsupervised Deep Learning and the Variational Renormalization Group [10].

Learn more about us at:   http://calculationconsulting.com

[1] LeCun et. al.,  The Loss Surfaces of Multilayer Networks, 2015

[2] Spin glasses and the statistical mechanics of protein folding, PNAS, 1987

[3] THEORY OF PROTEIN FOLDING: The Energy Landscape Perspective, Annu. Rev. Phys. Chem. 1997

[4] ENERGY LANDSCAPES, SUPERGRAPHS, AND “FOLDING FUNNELS” IN SPIN SYSTEMS, 1999

[5] From Levinthal to pathways to funnels, Nature, 1997

[6] QUALITATIVELY CHARACTERIZING NEURAL NETWORK OPTIMIZATION PROBLEMS, Google Research (2015)

[7]  Mimicking the folding pathway to improve homology-free protein structure prediction, 2008 

[8] Funnels in Energy Landscapes, 2007

[9] Landscape Statistics of the Low Autocorrelated Binary String Problem, 2007

[10] A Common Logic to Seeing Cats and Cosmos, 2014

Convex Relaxations of Transductive Learning

This post is finishing up some older work of an R&D / coding project in Transductive Learning.

Why are SVMs interesting?  It is just a better way to do Logistic Regression?  Is it the Kernel Trick?  And does this even matter now that Deep Learning is everywhere? To the beginning student of machine learning, SVMs are the first example of a Convex Optimization method.  To the advanced practitioner, SVMs are the starting point to creating powerful Convex Relaxations to hard problems.

Historically, convex optimization was seen as the path to central planing an entire economy.  A great new book, Red Plenty, is “about the scientists who did their genuinely brilliant best to make the dream come true …” [amazon review].

It was a mass delusion over the simplex method, and it is about as crazy as our current fears over Deep Learning and AI.

Convex optimization is pretty useful, as long as we don’t get crazy about it.

The prototypical method convex relaxation is for Transductive Learning and the Transductive SVM (TSVM)

Vapnik proposed the idea of Transduction many years ago;  indeed the VC theory is proven using Transduction.   I would bet that he knew a TSVM could be convexified–although I would need a job at Facebook to verify this.

A good TSVM has been available since 2001 in SvmLight,  But SvmLight is not opensource, so most people use SvmLin.

Today there are Transductive variants of Random Forests, Regression, and even Deep Learning.  There was even a recent Kaggle Contest–the Black Box Challenge (and, of course, the Deep Learning method won).  Indeed, Deep Learning classifiers may benefit greatly from Transductive/SemiSupervised pretraining with methods like Pseudo Label [8], as shown in the Kaggle The National Data Science Bowl contest.

We mostly care about binary text classification, although there is plenty of research in convex relaxation for multiclass transduction, computer vision, etc.

Transductive learning is essentially like running a SVM, but having to guess a lot of the labels.  The optimization problem is

\min_{\mathbf{y}\in\mathcal{B}}\min_{f}\,\,\Omega(f)+\sum\limits_{i=1}^{N}\lambda\mathcal{L}(\mathbf{x}_{i},y_{i})

where \mathcal{L}  is the loss function , \Omega the regularization function,

and the binary labels \mathbf{y_i}\in\mathcal{B}\mid y_{i}\in\left\{1,\,-1,\,unk\,\right\}  are only partially known.

The optimization is a non-convex, mixed-integer problem.  Amazingly, we can reformulate the TSVM to obtain a convex optimization!

This is called a Convex Relaxation, and it lets us guess the unknown labels…

Convex Relaxation

to within a good approximation, and using some prior knowledge.

Proving an approximation is truly convex is pretty hard stuff, but the basic idea is very simple.  We just want to find a convex approximation to a non-convex function.

It has been known for a while that the TSVM problem can be convexified [5].  But it has been computationally intractable and there is no widely available code.

We examine a new Convex Relaxation of the Transductive Learning called the Weakly Labeled SVM (WellSVM) [2,3].

In the Transductive SVM (TSVM) approach, one selects solutions with minimum SVM Loss (or Slack)

\mathcal{L}(\mathbf{x}_{i},y_{i})=\xi_{i}

and the maximum margin

\Omega=\dfrac{1}{2}\Vert\mathbf{w}\Vert^{2}

The SVM Dual Problem

Let us consider the standard SVM optimization

\underset{\mathbf{w,\xi}}{\min}\,\,\dfrac{1}{2}\parallel\mathbf{w}\parallel^{2}+\lambda\sum_{i=1}^{N}\xi_{i}

which has the dual form

\underset{\alpha}{\max}\,\,\alpha^{\dagger}\mathbf{1}-\dfrac{1}{2}\mathbf{y^{\dagger}}\boldsymbol\alpha^{\dagger}X\mathbf{X}\boldsymbol\alpha\mathbf{y}\quad,\alpha_{i}>0

To keep the notation simpler, we will not consider the Kernalized form of the algorithm.  Besides, we are mostly interested in text classification, and Kernels are not needed.

Balancing the Labels

In a TSVM, we have to guess the labels y_{i}\in\left\{1,\,-1\right\} and select the best solution (for a given set of regularization parameters).  There are way too many labes to guess, so

we need to constrain the label configurations by balancing the guesses

We assume that we have some idea of the total fraction of positive (+) labels \mathcal{B}_{+}

  • Perhaps we can sample them?
  • Perhaps we have some external source of information.
  • Perhaps we can estimate it.

This is, however, a critical piece of prior information we need. We define the space of possible labels as

\left\{ \mathbf{y}\vert\sum_{i=1}^{N}y_{i}=\beta\right\}

i.e, for exactly half positive / negative labels, then

\mathcal{B}_{+}=\dfrac{1}{2}

and the true mean label value is zero

y_{avg}=\bar{y}=0

So we are saying that if we know

  1. the true fraction \mathcal{B}_{+} of (+) labels
  2. some small set of the true labels (i.e. < 5%)
  3. the features (i.e. bag-of-words for text classification)

Then we know almost all the labels exactly!  And that is powerful.

Note:  iI is critical that in any transductive method, we reduce the size of the label configuration space.  The Balancing constraint is the standard constraint–but it may be hard to implement in practice.  To me, personally, the problem resembles the matrix product states method in quantum many body theory. And I suspect that, eventually, there will be a related method found for transducitive learning.  Perhaps even Deep Learning has found this.

Convex Methods

The popular SvmLin method use a kind of Transductive Meta-Heuristics that set the standard for other approaches.  The problem is, we never really know if we have the best solution.  And it is not easy to extend to multiclass classification.

 Convex methods have been the method of choice since Dantzig popularized the simplex method in 1950

Although linear programming itself was actually invented in 1939 by Leonid Kantorovich [7] — “the only Soviet scholar ever to win the Nobel Prize for Economics” 

A convex method lends itself to production code that anyone can run.

The WellSVM Convex Relaxation

More generally, we need to solve a non-convex min-max problem of the form

\underset{\mathbf{y\in\mathcal{B}}}{\min}\,\,\underset{\alpha\in\mathcal{A}}{\max}\, \,G(\mathbf{y,\alpha})

where the G matrix is

G(\mathbf{y,\alpha})=\alpha^{\dagger}\mathbf{1}-\dfrac{1}{2}\mathbf{y^{\dagger}}\boldsymbol\alpha^{\dagger}\mathbf{X^{\dagger}X}\boldsymbol\alpha\mathbf{y}

where α lies in the convex set

\mathcal{A}=\left\{\boldsymbol\alpha\mid C\mathbf{1}\geq\alpha\geq 0\right\}

Notice that G is concave in α and (can be made) linear in the y’s [3].   We seek a convex relaxation of this min-max problem.  And, as importantly, we want to code the final problem using an off-the-shelf SVM solver with some simple mods.   The steps are

1 .  Apply the Minimax Theorem

For details, see [4,5], although it was originally posed by John von Neumann in his work on Game Theory.

One could spend a lifetime stuyding von Neumann’s contributions.  Quantum mechanics.  Nuclear Physics.  Etc.

Here we scratch the surface.

The Minimax thereom lets us switch the order of the min/max bounds.

\underset{\mathbf{y\in\mathcal{B}}}{\min}\,\,\underset{\alpha\in\mathcal{A}}{\max}\, \,G(\mathbf{y,\alpha})\rightarrow\underset{\alpha\in\mathcal{A}}{\max}\,\,\underset{\mathbf{y\in\mathcal{B}}}{\min}\, \,G(\mathbf{y},\alpha)

The original problem is an upper bound to this.  That is

\underset{\mathbf{y\in\mathcal{B}}}{\min}\,\,\underset{\alpha\in\mathcal{A}}{\max}\, \,G(\mathbf{y,\alpha}) \geqslant\,(upper bound)\,\underset{\alpha\in\mathcal{A}}{\max}\,\,\underset{\mathbf{y\in\mathcal{B}}}{\min}\, \,G(\mathbf{y},\alpha)

To solve this, we

2. dualize the inner minimization (in the space of allowable labels)

We convert the search over possible label configurations into the dual max problem, so that the label configurations become constraints.

\underset{\alpha\in\mathcal{A}}{\max}\,\,\underset{\mathbf{y\in\mathcal{B}}}{\min}\, \,G(\mathbf{y},\alpha)=\underset{\alpha\in\mathcal{A}}{\max}\,\left\{ \underset{\mathbf{\theta}}{\max}\,\,G(\mathbf{y_{t}},\alpha)\geq\theta\vert\mathbf{y_{t}}\in\mathcal{B}\right\}

This linear in α and θ.  In fact, it is convex.

There are an exponential number of constraints in \mathcal{B} .

Even though it is convex, we can not solve this exactly practice.  And that’s…ok.

Not all possible labelings matter, so not all of these constraints are active (necessary) for an optimal solution.

We just need an active subset, \left\{\mathbf{y_{t}}\in\mathcal{C}\right\} , which we can find by …

The Cutting Plane, or Gomory Chvatal, method.  (Remember, if you want to sound cool, give your method a Russian name).

To proceed, we construct the Lagrangian and then solve the convex dual problem.

You may remember the method of Lagrange multipliers from freshman calculus:

3. We introduce Lagrange Multipliers for each label configuration

The Lagrangian is

\theta+\underset{\boldsymbol\mu,\mathbf{y_{t}}\in\mathcal{B}}{\sum}\mu_{t}(G(\mathbf{y_{t}},\alpha)-\theta)

When we set the derivative w.r.t. \theta to 0, we find

\sum\boldsymbol\mu_{t}=1

This lets us rewrite the TSVM as

\underset{\alpha\in\mathcal{A}}{\max}\,\,\underset{\mathbf{\mu}\in\mathcal{M}}{min}\,\,\underset{\boldsymbol\mu,\mathbf{y_{t}}\in\mathcal{B}}{\sum}\mu_{t}G(\mathbf{y_{t}},\alpha)

where the set of allowable multiplers \boldsymbol\mu lies in the simplex \mathcal{M}

\mathcal{M}=\left\{\boldsymbol\mu\mid\sum\mu_{t}=1\,\,,\mu_{t}\geq 0\right\}

The resulting optimization is convex in µ and concave in α.

This is critical as it makes a Kernel-TSVM a Multiple Kernel Learning (MKL) method.

 “WellSVM maximizes the margin by generating the most violated label vectors iteratively, and then combines them via efficient multiple kernel learning techniques”.

For linear applications, we need only consider 1 set of \boldsymbol\mu .

Now replace the inner optimization subproblem with its dual.  Of course, the dual problem is a lower bound on the optimal value. We then

4. switch the order of the min/max bounds back

to obtain a new min-max optimization–a convex relaxation of the original

\underset{\boldsymbol\mu\in\mathcal{M}}{min}\,\,\underset{\alpha\in\mathcal{A}}{\max}\,\,\underset{\boldsymbol\mu,\mathbf{y_{t}}\in\mathcal{B}}{\sum}\mu_{t}G(\mathbf{y_{t}},\alpha)

When we restrict the label configurations to  the working set \left\{\mathbf{y_{t}}\in\mathcal{C}\right\} ,we have

\underset{\boldsymbol\mu\in\mathcal{M}}{min}\,\,\underset{\alpha\in\mathcal{A}}{\max}\,\,\underset{\boldsymbol\mu,\mathbf{y_{t}}\in\mathcal{C}}{\sum}\mu_{t}G(\mathbf{y_{t}},\alpha)

Implementation Details

A key insight of WellSVM is that the core label search

\underset{\mathbf{\hat{y}}\in\mathcal{C}}{\min}\,\,G(\mathbf{\hat{y}},\alpha)

is equivalent to

\underset{\mathbf{\hat{y}}\in\mathcal{C}}{\max}\,\,\mathbf{\hat{y}^{\dagger}\boldsymbol\alpha^{\dagger}X^{\dagger}X\boldsymbol\alpha\hat{y}}

To me, this is very elegant!

We search the convex hull of the document-document density matrix, weighted by the Langrange multipliers for the labels.

How can we solve a SVM with exponential constraints?   Take a page from old-school Joachim’s Structural SVMs [6].

Cutting Plane Algorithm for WellSVM

This is the goto-method for all Mixed Integer Linear Programming (MILP) problems. On each iteration, we

  • obtain the Lagrange Multipliers \boldsymbol\alpha  with an off-the-shelf SVM solver
  • find a violating constraint (label configuration) \mathbf{\hat{y}}

This grows the active set of label configurations \mathcal{C} , learning from previous guesses.  We expect $latex N_{\mathcal{C}}\ll N_{\mathcal{B}} $

the PseudoCode is

  1.  Initialize \mathbf{\hat{y}} and \mathcal{C}=\emptyset
  2.  repeat
  3.    Update  \mathcal{C}\leftarrow\mathbf{\hat{y}}\cap\mathcal{C}
  4.    Obtain the optimal α from a dual SVM solver
  5.    Generate a violated \mathbf{\hat{y}}
  6.  until G(\boldsymbol\alpha,\mathbf{\hat{y}})>\underset{\mathbf{y}\in\mathcal{C}}{min}G(\boldsymbol\alpha,\mathbf{y})-\epsilon
Finding Violated Constraints

Screen Shot 2015-03-15 at 1.20.41 AM

The cutting plane algo finds a new constraint, or cuts, on each iteration, to ‘chip away’ at a problem until the inner convex hull is found.   It usually finds the most violated constraint on each iteration, however,

With SVMs we can get good results just finding any violated constraint.

Here, we seek \mathbf{y*} , a violated label assignment.  The WellSVM paper [2,3] provides a great solution.

For any violation \mathbf{y*} , and for any pair of label configurations \mathbf{\bar{y},y} , we have

\mathbf{y^{\dagger}Hy*}\neq\mathbf{y^{\dagger}H\bar{y}}

where \mathbf{H}=\boldsymbol\alpha^{\dagger}\mathbf{X^{\dagger}X}\boldsymbol\alpha

This lets us compute \mathbf{y*} in two steps:

First, compute \mathbf{\bar{y}} , by searching the current,  active, and usually small set \mathcal{C}

\mathbf{\bar{y}}=\arg\max_{\mathbf{y}\in\mathcal{C}}\,\mathbf{y^{\dagger}Hy}

Second, compute \mathbf{y*} , by searching the set of all balanced label configurations \mathcal{B}

\mathbf{y*}=\arg\max_{\mathbf{y}\in\mathcal{B}}\,\mathbf{y^{\dagger}H\bar{y}}

Original Matlab Code

We will review the existing code and run some simulations soon

Python Implementation

The goal is to modify scikit learn interface to liblinear to output the Lagrange multipliers \boldsymbol\alpha .  then to implement WellSVM in an ipython notebook.  stay tuned.

References

[1]  The Semi-Supervised Learning Book (2006)

[2] Convex and Scalable Weakly Labeled SVMs , JMLR (2013)

[3] Learning from Weakly Labeled Data (2013)

[4] Kim and Boyd, A Minimax Theorem with Applications to Machine Learning, Signal Processing, and Finance, 2007

[5] Minimax theorem, game theory and Lagrange duality

[6] Cutting-Plane Training of Structural SVMs

[7] “This is an example of Stigler’s law of eponymy, which says that “no scientific discovery is named after its original discoverer.” Stigler’s law of eponymy is itself an example of Stigler’s law of eponymy, since Robert Merton formulated similar ideas much earlier  Quote from Joe Blitzstein on Quora

[8] Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks

The Bitcoin Crash and How Nature Works

Last week, On Jan 14, 2015 Bitcoin dropped below $200 (USD)

coindesk-bpi-chart

2 years of bitcoin 2103-2015

Moreover, several Bitcoin exchanges have shut down due to hacks and/or criminal activity, and key BitCoin players are on trial. This begs the question:

 Why is BitCoin Crashing ?

Markets crash all the time.  Stock markets.  Currency Markets.   Even the Dutch Tulip market of 1637 crashed–although even this is still hotly debated.

Fraud?  Speculation?  Mania?  Lack of Regulation?  What gives?

Market Crashes: the EconPhysics Perspective

In 1996, researchers at the University of Chicago[1] and elsewhere [2] independently proposed that market crashes resemble physical crashes such as earthquakes, ruptures, and avalanches.  Such phenomena arise from long range, collective fluctuations — i.e. herding behavior — that overtakes a system as it approaches a critical point…and/or in the aftershocks.

Discrete Scale Invariance in the SP500 Crash of 1987  [1]
Discrete Scale Invariance in the SP500 Crash of 1987 [1]
The theory relies upon a fairly generous application of the theory of critical phenomena and the Renormalization Group (RG) theory.  It provides both a qualitative description of the phenomena and quantitative predictions such as when a crash may occur or how long a bear market will continue.

Most notably, the strongest proponent, Didier Sornette, accurately estimated the long bear run of the Japanese markets–what he terms a Financial Anti-Bubble:

nikkei

He has written numerous papers, a book, and even a TED Talk.

We also introduced these ideas in a previous post:  Noisy Time Series II: Earth Quakes, Black Holes, and Machine Learning

Power Laws and Symmetry Breaking

It is well known that financial markets display non-Gaussian, long tail, power law-like behavior.  After a crash model, the price series itself p(t) may follow a power law

p(t)=A+B(t-t_{c})^{\alpha}

where t_{c} is the time of the crash, the time t>t_{c} , and \alpha is a real number.

We can fit the recent Bitcoin EOD (End of Day) prices to a power law, starting at it’s peak on Nov 11, 2013:

power_law

We obtain a modestly good least squares fit, with \alpha=0.32 and R^{2} = 0.77 .  Clearly there is a lot of noise–or is there?

RG theory suggests that near a crash, the power law may become complex (i.e a form of  ‘symmetry breaking’).  We will now sketch how this comes about, using the nobel prize winning Renormalization Group theory.

Complex Power Laws and the Renormalization Group

When examining a price series p(t) , or some other phenomena, say F(x) , a critical point occurs when we observe a rapid, sharp spike–i.e. Bitcoin’s fast rise up to Nov 29, 2013.

Here, we will find that as x\rightarrow 0

F(x)=A+B(x)^{\alpha} ,

where \alpha is a complex critical exponent

We say F(x) becomes singular near a crash when there exists a k-th derivative that becomes infinite as x\rightarrow 0 :

\lim_{x\rightarrow0}\dfrac{d^{k}F(x)}{dx^{k}}=0

We also know that near a critical point, physical systems exhibit scale invariant (i.e. fractal) behavior. The noise, or fluctuations, we observe on small time scales x=t-t_{c} look just like fluctuations on larger time scales \phi(x) .

We call \phi(x) the RG flow map.  The flow map describes these change of time scales, and the RG equation describes the invariances.

The fundamental RG equation (eqn) is:

x\rightarrow\phi(x)

F(x)\rightarrow g(x)+\dfrac{1}{\mu}F(\phi(x))

We assume F(x) is continuous–which is non trivial for price series –and that \phi(x) is differentiable. g(x) is the non-singular (regular) part of F(x) .

The magic of the RG theory is that it allows us to describe the behavior near a critical point knowing only the flow map \phi(x) and the regular function g(x) .  The formal solution is an infinite order power series

F(x)=\underset{n\rightarrow\infty}{\lim}f^{n}(x) ,

f^{n}(x)=\sum_{i=1}^{n}\dfrac{1}{\mu^{i}}g(\phi^{[i]}(x))

We will sketch a very simple, leading order approximation to F(x)  that leads to a complex power law.

RG Theory and Discrete Scale Invariance

We start by first assuming power law behavior (at lowest order):

F_{0}(x)\sim x^{\alpha}

We seek the simplest RG solution.  Let us assume the RG flow map is linear in x :

\phi(x)=\lambda x+\cdots ,

where \lambda > 1 (for technical reasons to ensure the critical point is an unstable fixed point of the RG flow).

We will ignore the regular solution g(x) and write the simplified RG eqn:

F(\lambda x)=\mu F(x)

For our simple power law, we have

(\lambda x)^{\alpha}=\mu x^{\alpha}

This is easy to satisfy if

\dfrac{\lambda^{\alpha}}{\mu}=1

or

\alpha=\dfrac{log(\mu)}{log(\lambda)}

We seek the most general solution, applicable to discrete physical systems, such as earthquakes, ruptures in materials–and financial market crashes.   That is, we expect \lambda to take on discrete values:

\lambda\in[\lambda_{1},\lambda_{2},\cdots]

We call this Discrete Scale Invariance (DSI)

The most general solution of DSI is

F(x)=x^{\alpha}P(\dfrac{\log x}{\log \lambda})

where P() is a general periodic function.  As above, we express P() as the limit of an infinite power series

P_{n}(x)=\sum_{i=1}^{n}c_{n}\exp\left[2n\pi i\left(\dfrac{\log x}{\log\lambda}\right)\right]

This is a classic Weierstrass function–a pathological function that, in the infinite limit, is continuous everywhere and yet differentiable nowhere.  It is used to model fractals, natural systems that are scale invariant, and highly discontinous but structured time series.

In a crash scenario, we expect that the true power series for P(x) will diverge; i.e. the k-th derivative explodes.  So here, we only take the leading term of our linearized model

exp\left[2n\pi i\left(\dfrac{\log x}{\log\lambda}\right)\right]

Note that it takes the form

F(x)=F_{0}(x)p(\log F_{0}(x))

so F_{0}(x), simple power law behavior, is like the regular part of the solution.

We now see that we have a complex critical exponent \alpha :

\dfrac{\lambda^{\alpha}}{\mu}=e^{2\pi n}

or

\alpha=\dfrac{log(\mu)}{log(\lambda)}+i\dfrac{2\pi n}{log(\lambda)}

We consider the real part of the complex power law

Re[x^{\alpha+i\beta}]=x^{\alpha}cos(\beta\log x)

which gives the following log periodic formula for the price series

p(t)=A+B(t-t_{c})^{\alpha}(1+C\cos(\omega\log(t-t_{c})+\phi))

Instead of fluctuating randomly around the underlying power law drift, the price series exhibits log-periodic oscillations–the DSI signature of a crash.

This model appears to work for 1-2 oscillations before (or after) the crash.  To model the longer time anti-bubble behavior, Sornette has developed an extended formula, based on the third-order Landau expansion (as opposed to, say, additional terms in the power series defined above).  We leave these details and further analysis for later.

DSI Fit of the Bitcoin Crash

We now fit the latest Bitcoin behavior, assuming the crash / Bear market started on Nov 29, 2013 :

rg

We readily find a DSI fit, with R^2=0.89 .

It is actually quite difficult to get a very tight fit, and usually advanced methods are needed. This is a simple, crude fit done to illustrate the basic ideas.

We see that Bitcoin, like other financial markets, displays log periodic behavior, characteristic of a self-organized crash.

 These kinds of crashes are not caused by external events or bad players–they are endemic to all markets and result from the cooperative actions of all participants.

We can try to detect them, perhaps even profit off them, but it is unclear how to avoid these Self-organized Critical points that wreck havoc on our finances.

Conclusion: How Nature Works

Bitcoin is an attempt to create a new kind of currency–a CryptoCurrency–that is free from both institutional control and individual corruption.  It is hoped that we can avoid the kind of devastating inflation, devaluation, and crashes that occur all too frequently in our current worldwide banking systems.  The promise is that Bitcoin, backed by the Blockchain, can remove our dependence on a single, faulty institution and replace it with a decentralized, distributed form of trust.  But

“Real life operates at a critical point between order and chaos”

Our analysis here shows that even the Bitcoin economy still appears to behave like a traditional market–prone to kinds of crashes that frequently arise in all natural self organized systems.

This Self-Organized Criticality is perhaps just how nature works [6,7].

And while Bitcoin is certainly interesting and exciting, perhaps it too is subject to the natural laws of physics.

References

[1] 1995  James A. Feigenbaum and Peter G.O. Freund,  Discrete Scaling in Stock Markets Before Crashes

[2] TED Talk: How We Can Predict the Next Financial Crises

[3] What will upend the economy next? Economist didier sornette explains

[4] D. Sornette, Critical market crashes

[5] 1999 Predicting Financial Crises using Discrete Scale Invariance

[6] 1999 A. Johansen and  D. Sornette, Financial “Anti-Bubbles”: Log-Periodicity in Gold and Nikkei collapses

[7] Anders Johansen and Didier Sornette, Evaluation of the quantitative prediction of a trend reversal on the Japanese stock market in 1999 , 2000

[8] 2003 Wei-Xing Zhou and  Didier Sornette  Renormalization group analysis of the 2000–2002 anti-bubble in the US S&P500 index: explanation of the hierarchy of five crashes and prediction 

[9] 20122 Didier Sornette, Ryan Woodard, Wanfeng Yan, and Wei-Xing Zhou  Clarifications to Questions and Criticisms on the Johansen-Ledoit-Sornette Bubble Model

[10] http://guava.physics.uiuc.edu/~nigel/courses/563/essay/lin.ps.gz

[11] Per Bak,  “How Nature Works: The Science of Self-Organized Criticality”,  1996

SVM+ / LUPI: Learning Using Privileged Information

Recently Facebook hired Vapnik, the father of the Support Vector Machine (SVM) and the Vapnik-Chrevoniks Statistical Learning Theory.

Facebook Hires Vapnik

Lesser known, Vapnik has also pioneered methods of Transductive and Universum Learning.  Most recently, however, he has developed the theory of the Learning Using Privileged Information (LUPI), also known as the SVM+ method.

And this has generated a lot of interest in the community to develop numerical implementations.

In this post, we examine this most recent contribution.

Preface

Once again, we consider the problem of trying to learn a good classifier having a small set of labeled examples x_{l}\in L .

Here, we assume we have multiple views of the data; that is, different, statistically independent feature sets \phi(x),\phi*(x) , that describe the data.

For example, we might have a set of breast cancer images, and some holistic, textual description provided by a doctor.  Or, we might want to classify web pages, using both the text and the hyperlinks.

Multi-View Learning is a big topic in machine learning, and includes methods like SemiSupervised CoTraining, Multiple Kernel Learning, Weighted SVMs, etc.  We might even consider Ensemble methods as a kind of Multi-View Learning.

Having any extra, independent info x^{*} about our data is very powerful.

In LUPI, also called the SVM+ method, we consider situation where we only have extra information x^{*}_{l}  about the labeled examples x_{l}\in L .

Vapnik calls \{x^{*}\} Privileged Information  — and shows us how to use it

The VC Bounds

If we are going to discuss Vapnik we need to mention the VC theory.  Here, we note an important result about soft and hard-margin SVMs:  when the training set is non-separable, the VC bounds scale as \dfrac{h}{\sqrt{L}} :

P_{test}\;\le\;\nu_{train}+O\left(\dfrac{VCdim}{\sqrt{L}}\right)

where \nu_{train}  is the training error, and h is the VC dimension

But when the training set is separable,  i.e. \nu_{train}=0 , the VC bounds scale as \dfrac{h}{L}

P_{test}\;\le\;O\left(\dfrac{VCdim}{L}\right)

This means we can use say L=300 labeled examples as opposed to L=900,000, which is a huge difference.

It turns out we can also achieve \dfrac{h}{L} scaling–if we have an Oracle that tells us, a priori, the slacks.  Hence the term Oracle SVM.    The Oracle tells us how much confidence we have in each label.

So if we can get the slacks \{\varepsilon_{l}\} , or, equivalently, the weights, for each instance, L can be much smaller.

[Of course, this is only true if we can sample all the relevant features.]

Soft Margin SVM

In practice, we usually assume data is noisy and non-separable. So we use a soft-margin SVM where we can adjust the amount slack with the C (cost) parameter,

Note that we also have the bias b , but we can usually set to zero (still, be careful doing this!).

Let us write the soft-margin SVM optimization as

\arg\min\left(\dfrac{1}{2}\langle w,w\rangle+C\sum_{l\in L}\varepsilon_{l}\right)

subject to the constraints

y_{l}\left(\langle w_{l},x_{l}\rangle+b\right)\ge 1-\varepsilon_{l}\;\;\forall\;l\in L,\;\;\varepsilon_{l}\ge 0

Notice that while we formally add slack variables \{\varepsilon_{l}\} to max-margin optimization–they eventually vanish.  That is, we don’t estimate the slacks; we replace them with the maximum margin violation, or the Hinge Loss error

\hat{err_{l}}=\max[1-y_{l}(\langle w,x_{l}\rangle+b,0)]

We then solve the unconstrained soft-margin SVM optimization, which is a convex upper bound to the problem stated above–as explained in this nice presentation on LibLinear.

\arg\min\left(\dfrac{1}{2}\langle w,w\rangle+C\sum_{l\in L}\max[1-y_{l}(\langle w,x_{l}\rangle+b,0)]\right)

or, in simpler terms

\arg\min\left(\dfrac{1}{2}\langle w,w\rangle+C\sum_{l\in L}\hat{err_{l}}\right)

And this works great–when we have lots of labeled data L.

What if we could actually  estimate or assign all the L slack variables \{\varepsilon_{l}\} ?  In principle, we can far use labeled examples L.  This is the promise of LUPI.

LUPI/SVM+

In LUPI,  we use the privileged information \{x^{*}_{l}\} to learn the slack variables \{\varepsilon_{l}\} :

\varepsilon_{l}=\langle w^{*},x_{l}^{*}\rangle+b^{*}

We model the slack as linear functions of the privileged information.

Effectively, the privileged information provides a measure of confidence for each labeled instance.

To avoid overtraining, we apply a max-margin approach.  This leads to an extended, SVM+ optimization problem, with only 2 adjustable regularization parameters C,\gamma (and 2 bias params):

\arg\min\left(\dfrac{1}{2}\langle w,w\rangle+\dfrac{\gamma}{2}\langle w^{*},w^{*}\rangle+C\sum_{l\in L}[\langle w^{*},x_{l}^{*}\rangle+b^{*} ]\right)

subject to the constraints

y_{l}\left(\langle w_{l},x_{l}\rangle+b\right)\ge 1-[\langle w^{*},x_{l}^{*}\rangle+b^{*}],

\langle w^{*},x_{l}^{*}\rangle+b^{*}\ge 0\;\;\forall\;l\in L

Show me the code?

The SVM+/LUPI problem is a convex (quadratic) optimization problem with linear constraints–although it does require a custom solver.  The current approach is to solve the dual problem using a variant of SMO.  There are a couple versions (below) in Matlab and Lua.  The Matlab code includes related SVM+MTL (SVM+ MultiTask Learning).  I am unaware of a open source C or python implementation similar to Svmlin or Universvm.

To overcome the solver complexity,  other methods, such as the Learning to Rank method, have been proposed.  We will keep an eye out for useful open source platforms that everyone can benefit from.

Summary

The SVM+/LUPI formalism demonstrates a critical point about modern machine learning research.  If we have multiple views of our data,  extra information available during training (or maybe during testing), or just several large Hilbert spaces of features, then we can frequently do better than just dumping everything into an SVM, holding our nose, and turning the crank.

References:

or how many ways can we name the same method?

Learning with Hidden Information

The SVM+ method: MIT Reading Notes

Learning with Non-Trivial Teacher

Learning to Rank Using Privileged Information

On the Theory of Learning with Privileged Information

SMO-style algorithms for learning using privileged information

Learning Using Privileged Information: SVM+ and Weighted SVM

Learning with Asymmetric Information

Comparison of SVM+ and related methods (SVM+MTL, rMTL, SVM+ regression, etc.)

Proof and Implementation of Algorithmic Realization of Learning Using Privileged Information (LUPI) Paradigm: SVM+

Videos

Learning with Privileged Information

Simmons Foundation Lecture

Software

Matlab and CVX Versions

Milde LUA Code

Machine Learning with Missing Labels Part 3: Experiments

In this series of posts we look at Transductive and SemiSupervised Learning–an old problem, a hard problem, and a fundamental problem Machine Learning.  Unlike Deep Learning or large scale ML,

We want to learn as much as we can from as labeled little data as possible.

We focus on text classification.  Why?

  1. The feature space is simple — its just words.
  2. With the true labels, a simple SVM gives near perfect accuracy.
  3. Text satisfies the cluster assumption.

Why isn’t this done all the time?

We can generate good (not great) Semi Supervised text classifiers.   But

  1. creating the input data is hard.
  2. setting the input parameters is harder.
  3. we can’t always distinguish the good solutions from the bad.

The 1st post looked at (1); here we try to understand (2&3), and look for simple, practical heuristics to make these old methods useful.

We can think of the TSVM/S3VM as generating several potentially good labelings on the unlabeled data. To solve (3), we need to find the best labeling–which means judging the label and/or classifier quality, such as the label entropy , cluster overlap, etc.

Or, alternatively, we might try to average the good solutions, similar to the way Hinton et. al. have recently suggested averaging an ensemble of models to take advantage of Dark Knowledge.

Thanks to Matt Wescott for pointing out this PDF and discussions on it.

To explore this, we run experiments using some old, existing, open source Transductive SVM (TSVM) and SemiSupervised SVM (S3VM) codes:

Our work is guided by  the results in this 2014 paper on the QN-S3VM method, as applied to text data.

We thank Fabian Gieseke for help in reproducing and understanding these results.

For the TSVM, some available codes and algorithms are

  • SvmLight:  Graph Cutting Algo
  • SvmLin:  MultiSwitch (MS-TSVM) and Deterministic Annealing (DA-TSVM)
  • UniverSvm: Convex Concave Continuation Procedure (CCCP)
  • QN-S3VM: Quasi Newton SemiSupervised SVM

We can also compare the results to a SVM baseline, using either

All these codes are Unix standalone C programs except QN-S3VM and Scikit-Learn, which are Python.

We examine SvmLin and its MultiSwitch (MS-TSVM) method.

SvmLin is the TSVM benchmark for most studies.  After we work this up, we will move on to other codes listed. I also would like to test other TSVM/S3VM methods, such as S3VM-MIP, S4VM, S3VMpath, and many others, but may are MatLab codes and  I don’t have C or Python open source code readily available.  Or really the time :P

We do not consider cases of weakly-supervised learning, where the labels may be both missing and/or have errors… but let me point out  a 2013 paper on the WellSVM method.  It also compares the above methods and is worth a read.

I hope this post gives some flavor to what is a very complicated and fundamental problem in machine learning.  (And maybe take some ideas from Deep Learning to help solve the TSVM problem as well)

Methods: The real-sim Data Sets:

We care about text classification. To this end, we reproduce the performance of SvmLin, and, later,  other methods, on the real-sim text dataset (listed in Table 2 in the paper).  The real-sim dataset was used in the original SvmLin paper; it is an accepted baseline for these methods.

This data set consists of  72309 labeled documents: 22238 (+1) and 50071 (-1).

The fraction of positively labeled documents R^{+}_{true}=30.75% .

Creating the Labeled, Unlabeled, and Holdout Sets:

I have created an IPython Notebook (make_realsim_inputs.ipynb) which can read the real-sim data and generate the input files for the various C programs listed above: SvmLight, SvmLin, & UniverSvm.

We can see it in the NBViewer here

The real-sim dataset is split in half, and 3 data sets are created.   Half is used to train TSVM (labeled L and unlabeled U); the rest is a holdout set (HO) for testing.

Following Table 2, the notebook generates:

  • labeled (L) of sizes roughly: l=90, 180, 361, 1486, & 2892
  • an unlabeled set (U) of size u=36154-l.
  • a test or HoldOut set HO (of the remaining data)

Each data set (L,U,HO) also has ~0.31 fraction of labeled examples; U and HO are statistical replica’s of L, albeit larger.

The input files consist of a training file (or 2 files for SvmLin) and 3 test files (6 forSvmLin): labeled L, unlabeled U, and holdout HO.   They are in the SvmLight format.

I.e. the SvmLin input files, for l=90,  are:

filename                                          size (lines)                                   purpose

svmlin.train.90.examples            36154                 labeled + unlabeled  train tsvm

svmlin.train.90.labels                   36154

svmlin.testL.90.examples                 90                 train labeled baseline

svmlin.testL.90.labels                        90

svmlin.testU.90.examples            36154                evaluate unlabeled set

svmlin.testU.90.labels                   36154

svmlin.testHO.90.examples         36155                evaluate holdout set

svmlin.testHO.90.labels                36155

 

Grid Searches with gnu parallel

Astoundingly, rather than performing a full grid search, many research papers fix the regularization parameters, guess them using some crude Bayesian scheme, or attempt to transfer them from some other data set. This makes it really hard to evaluate these methods.

We use a ruby script and the gnu_parallel program to run the command line unix programs and grid search the parameters. The script also computes the final HoldOut accuracy and metrics such as the margin and label entropy.

svmlin.rb

Gnu parallel lets us easily grid search the regularization parameters by running the TSVM codes in parallel.

We do a broad grid search

W\in[.0001,0.001,0.01,0.1,1,10,100,1000,10000]

U\in[.0001,0.001,0.01,0.1,1,10,100,1000,10000]

R\in[0.25, 0.26, 0.27,\cdots,0.35,0.36,0.37]

by creating a directory for each run

 

parallel "mkdir A2W{1}U{2}R{R}" ::: .0001 0.001 0.01 0.1 1 ::: 1 10 100 1000 1000 ::: 0.28 0.29 0.31 0.32 0.33 0.34

and then running SvmLin in each directory, simultaneously

 

parallel "cd A2W{1}U{2}R{R}; $SVMLIN -A 2 -W {1} -U {2} -R {3} ../svmlin.train.examples.90 ../svmlin.train.labels.90 > /dev/null"  ::: 0.0001 0.001 0.01 0.1 1  :::  1 10 100 1000 10000 ::: 0.28 0.29 0.31 0.32 0.33 0.34

We can then evaluate each run on the U or HO data set.  We will find the optimal W and U are in the range of the original crude estimates W=0.001, U=1, and R=0.31

How can we evaluate the accuracy of our TSVM, here, and in a real world setting?

Ground Truth

Since we know the labels on both the U and the HO sets , these are a ground truth from which we can evaluate how well the best model(s) perform.  We can get an upper bound on the best possible Generalization Accuracy (on the HO set) by training an SVM on L+U using the true labels, and then applying this model to HO.

The best expected HO accuracy here is ~ 97 %

Also, note this is different from the Reconstruction Accuracy on HO, which is > 99 %.

 We might also obtain a better generalization accuracy with different features, such as applying GloVe or even unsupervised pre-training on L+U.  We examine this in a future blog.   We are, in particular, interesting in comparing the Deep Learning Auto-Encoders with Convex NMF, including recent variants applied to document clustering.

Baseline

We need a baseline for comparison.  The IPython Notebook computes a baseline accuracy for the labeled data set (L); this can also be done with LibLinear or even SvmLin (using -A 1).

We then run the small L model on the HoldOut (HO) set for each l=90,180,…,  yielding baseline accuracies of

split size size of L (l) HO Accuracy
0.0025 90 85 %
0.005 180 87.5 %
0.001 361 90 %
0.004 1446 93 %
0.008 2892 94 %

please note that these are computed from 1 random sample, and may be slightly different (by ~1%) for each run. Also, that the command line and scikit-learn liblinear have different defaults; we use C=10,fit_intercept=False.

Finding the Best Accuracy:

We are generating a large number of nearly equivalent models, parameterized by

[\Gamma_{L}(U,W,R^{+})]

The inputs are related to the objective function in blog post 1.

W=\gamma , U=\gamma' ,  and r=0.31

Every useful model will have the same reconstruction accuracy on the labeled examples L, and every model proposes a different soft labeling for the unlabeled examples U.

How can we compare different models \Gamma_{L}(U,W,R^{+}) ?

Essentially, we use a margin method to guess good labelings of the data, but we need an unsupervised heuristic to select the optimal labeling.   Can we simply apply Cross-Validation to the small labeled set L?  Or Leave One Out CV (LOO-CV) ?  What filters can we apply to speed up the calculations ?

Best Accuracy on the Labeled Set

SvmLin may produce inferior or even nonsense models.  But more subtly,  some models may even have a very high training  (on U), and a very high test accuracy (on HO), but a very low accuracy on L.  These are not useful  because  in the real world, we don’t know the labels on U or HO.

We always want to first filter the possible  solutions by selecting those with the best accuracy on the L.

Filter by Final Objective Function  (FOF)

Since we are minimizing the objective function, we only consider solutions with FOF < 0.01.  Note this is very different than simply assuming the best solutions has absolute minimum FOF across all input parameters.

 We discard all solutions with Final Objective Function (FOF) > 0.01

For l=180, for example, this reduces the set of possible solutions from 1430 to 143.

Maximum Margin Solution…NOT

Wait, can’t we just take the solution with the maximum margin (or 1/norm)

m=\dfrac{1}{\sum_{i}w_{i}^{2}}

No. Think about how we practice supervised learning; we train a model, and set the regularization parameters by cross-validation.  In the TSVM and SSL methods, we can can also apply CV (or LOO-CV), but only to the labeled set L.

The maximum margin solution is only best for a given set of input / regularization parameters (U,W,R). Every model \Gamma_{L}(U,W,R^{+}) induces a different labeling on U  (that is, they form an equivalence class on L, not L+U).    In fact, the best overall solution has the minimum margin amongst a broad scan of (U,W).

Screen Shot 2014-10-31 at 8.31.54 PM

where the output weights w_{i} are taken from

svmlin.training.examples.90.weights

Optimal Parameters for the HoldOut Test Accuracy

(this section is still under construction)

Cross Validation and Leave One Out on the Labeled Data

As with a standard SVM, one can try to apply Cross Validation (CV) to the labeled set L to select the best TSVM model.  Most academic studies run 5-fold CV on the L set.  This is harder, however, because

  1. when L is small, the i.i.d. assumption fails, so CV my give spurious results
  2. we really should use Leave One Out Cross Validation (LOO-CV), but this increases the run time significantly
  3. the SvmLin code does not include CV or LOO-CV
  4. the SvmLin DA-TSVM is unstable and I had to switch to MS-TSVM to complete this
  5. my laptop cant take any more over clocking so I had to pause this for bit

I will finish this section once I either

  1. modify SvmLin to run CV / LOO-CV
  2. modify and use UniverSvm and the CCCP method, which is an order of magnitude faster than TSVM
  3. and/or get this all working in IPython+Star Cluster so I can run LOO-CV at scale
Minimum Entropy of the Soft Labels

The SvmLin DA algo minimizes the entropy on unlabeled examples S_{u} (it is the key stopping criteria, sec 3.2.3).  Perhaps the model with minimum S_{u} generalizes best?

This is the essence of Entropy Regularization--a common component of many Semi Supervised Learning formulations.

We can evaluate the proposed soft labels \mu_{u} of the unlabeled set U, found in

svmlin.testU.examples.90.outputs

We convert soft labels to a probability p_{u} :

p_{u}=\dfrac{{\dfrac{{1}}{1+exp(-\mu_{u})}}}{\sum\dfrac{{1}}{1+exp(-\mu_{u})}}

and compute the Entropy on U

S_{u}=-\sum_{u}(p_{u}ln(p_{u})+(1-p_{u})ln(1-p_{u}))

We hope that TSVM solutions with minimum S_{u} will consistently yield a very good Generalization Accuracy (on the HO set) across a grid of search (U,W,R)–and preliminary results confirm this.

Let’s look at the l=90 case, and plot first the HO accuracy vs the Label Entropy S_u .

Screen Shot 2014-10-31 at 8.38.15 PM

(Recall we only take solutions with FOF > 0.01 and the best training accuracy on the true labeled set.)

We see at 3-4 distinct sets of solutions with nearly equivalent entropy across a wide range of HO accuracy, and 2 classes include improved solutions (HO Accuracy > baseline 85%)

We call these Equivalence Classes under the Label Entropy.

Other solutions also show that the minimum Entropy solution is the best solution–for a fixed R^{+}_{in} :

Screen Shot 2014-10-31 at 8.52.20 PM

Screen Shot 2014-10-31 at 8.54.24 PM

We select the minimum Entropy class of solutions.

Notice that the l=90 case is greatly improved above the 85% baseline, whereas l=1446 shows only a slight (0.5%) improvement–and 2-3% less than the best expected accuracy of 97%. We see the same results in the QN-S3VM and WellSVM papers, and suggests that

Transductive and SSL Learning may be limited to small labelled sets

where the percentage of labeled data is L\sim 10-15%  of U.

Predicted Fraction of Positive Examples

Recall we have to input an estimate R^{+}_{in} of the fraction of positive labels on U R^{+}_{U} .

We assume we estimate R^{+}_{U} to within 10% using about 100 examples (l\sim O(100))

We would prefer a high quality estimate if possible because the final, predicted \hat{R}^{+}_{U}  and \hat{R}^{+}_{HO}  appears to depend strongly on R^{+}_{in} .

Here we test varying the input R^{+}_{in} .  We expect the predicted \hat{R}^{+}_{HO} to be correlated with the generalization accuracies, and preliminary results also confirm this (note that all plots show the predicted fraction \hat{R}^{+}_{HO} ):

Screen Shot 2014-10-31 at 8.38.48 PM

The predicted fraction \hat{R}^{+}_{HO}  also forms Equivalence Classes \Gamma_{L,S}(R^{+}) under the Entropy S_{U} .

Screen Shot 2014-10-31 at 10.55.26 PM

Of course, we know the true R^{+}_{HO}=0.31 , so the minimum Entropy solution is not the absolute best solution.

Perhaps a more generalized form of the Entropy would be more discerning?   We would prefer a python toolbox like this one and need a way to fit the parameters.

We might also try to improve the TSVM search algorithm, as in the S4VM: Safe Semi-Supervised SVM methods.  The S4VM tries to improve upon the SvmLin DA-TSM algo using Simulated Annealing, and then selects the best solutions by evaluating the final cluster quality.  This looks promising for simple applications.  Indeed, one might even try to implement it in python using the recent basinhopping library.

Transduction and the VC Theory

We have said that the VC Theory is a really theory about Transductive Learning.

We see now that to apply the TSVM in practice, the unlabeled data set U really needs to be an accurate statistical replica, as in the VC Statistical Learning Theory for Transduction.

We have tested cases where we have a good replica, but we did not estimateR^{+}_{U}  well; we have not yet tested cases where the U  is NOT a good replica and we can only estimate R^{+}_{U} partially.  This needs to be done.

In the literature this is called using class proportions, and some recent estimatation methods have been proposed (and here)–although the QN-S3VM papers do not attempt this.  Recently Belkin has shown how to estimate this fraction using methods of traditional integral methods.

The TSVM accuracy may depends on how well one can estimate the fraction (r) of positively labeled examples.

Also, we presume that TSVM models overtrain, and are biased towards either the (+) or (-) class.

Screen Shot 2014-10-31 at 11.32.54 PM

For l=1446, we find a single class of minimum Entropy entropy solutions

Screen Shot 2014-10-31 at 11.31.29 PM

with S\approxeq-5.467 , W=0.001, U=1 and \hat{R}^{+}_{HO}\in[0.25,0.37] .

They all have a high HO Accuracy–but may be biased towards choosing (+) labels because the minimum S solution has \hat{R}^{+}_{HO}=0.36 , not 0.31.

If we can’t estimate the R^{+}_{U}  well, may be able to average of the minimum Entropy solutions, thereby canceling out the biases in the over-trained models (i.e. average all \hat{R}^{+}_{U}\in[0.28\cdots 0.34] TSVM models)

 This is what Dark Knowledge, is all about–creating a simpler model by ensemble averaging– and we hope these recent advances in Deep Learning can help with Transductive Learning as well.

It is noted that the recent WellSVM method also addresses this problem.  WellSVM uses a convex relaxation of the TSVM problem to obtain a min max optimization over the set of possible models allowed within a range of R^{+}_{U}   ; this is quite clever!

In a future blog, we will examine how the current TSVMs perform when tuning their parameters to obtain the best predicted fraction. If this is sound, in a future post, we will examine how to estimate the class proportions for text data.

Discussion

to come soon..stay tuned

Appendix

Each is TSVM is trained with a linear kernel.      The regularization parameters are adjusted to give both optimal performance on the initial training set (L), and best reconstruction accuracy on the (U) set.

Each program requires slightly different command line options for these.  There are also tuning parameters for each solver, and is necessary to set the tolerances low enough so the solvers can converge.  For example, even for a linear SVM, the primal solvers may not converge as rapidly as the dual solver, and can even give different results on the same data set (despite Von Neumann’s minimax theorem). And the default for sckit-learn LinearSVC has the Bias on, whereas LibLinear has the Bias off.  So be careful.

Below we show examples of training each the TSVMs on the labeled set (L), and evaluating the accuracy on a HoldOut set (HO), for l=90.  We show typical values of the regularization parameters, but these do need to be tuned separately.  We also show how to run the popular Liblinear program as a baseline (which uses the svmlight format); the paper uses LibSVM.

Baseline Linear SVM:

liblinear  -C 0.1  -v 5  svmlight.testL.90

MS-SvmLin:

svmlin  -A 2  -W 0.001 -U 1 -R 0.31   svmlin.train.examples.90 svmlight.train.labels.90

svmlin  -f   svmlin.train.examples.90.weights svmlight.testHO.examples.90 svmlight.testHO.labels.90

SvmLin DA:

svmlin  -A 3  -W 0.001 -U 1 -R 0.31    svmlin.train.examples.90 svmlight.train.labels.90

svmlin  -f   svmlin.train.examples.90.weights svmlin.testHO.examples.90 svmlin.testHO.labels.90

SvmLight Graph Cuts:

svm_learn -p 0.31 svmlight.train.90  svmlight.model.90

svm_classify  -C 100  svmlight.testHO.90  svmlight.model.90   svmlight.outHO.90

UniverSvm CCCP TSVM w/ramp loss:

universvm -C 1 -c 1 -s -0.5 -z 1 -o 1 -T universvm.testHO.90 universvm.train.90