Why does Deep Learning work?

Why does Deep Learning work?

This is the big question on everyone’s mind these days.  C’mon we all know the answer already:

“the long-term behavior of certain neural network models are governed by the statistical mechanism of infinite-range Ising spin-glass Hamiltonians” [1]   In other words, 

Multilayer Neural Networks are just Spin Glasses

Ok, so what is this and what does it imply?

In a recent paper by LeCun, he attempts to extend our understanding of training neural networks by studying the SGD approach to solving the multilayer Neural Network optimization problem [1].    Furthermore, he claims

None of these works however make the attempt to explain the paradigm of optimizing the highly non-convex neural network objective function through the prism of spin-glass theory and thus in this respect our approach is very novel.  And this is kinda true

But here’s the thing…we already have a good idea of what the Energy Landscape of multiscale spin glass models* look like–from early theoretical protein folding work (by Wolynes, Dill, etc [2,3,4]).  In fact, here is a typical surface:

Energy Landscape of MultiScale Spin Glass Model

*[technically these are Ising spin models with multi-spin interactions]

Let us consider the nodes, which above represent partially folded states, as nodes in a multiscale spin glass–or , say, a multilayer neural network.  Immediately we see the analogy and the appearance of the ‘Energy funnel’ In fact, researchers have studied these ‘folding funnels’ of spin glass models over 20 years ago [2,3,4]

3D Energy Landscape of the Folding Funnel of a Spin Glass

Note: the Wolynes protein-folding spin-glass model is significantly different from the p-spin Hopfield model that LeCun discusses because it contains multi-scale, multi-spin interactions.  These details matter.

So with a surface like this, it is not so surprising that an SGD method might be able to find the Energy minima (called the Native State in protein folding theory). We just need to go down, and jump around enough to get over the local saddle points.  This, infact, defines a so-called ‘folding funnel’  Indeed, such a surface, at the time, seemed to be necessary to resolve Levinthal’s paradox [4]

So is not surprising at all, in fact, that SGD may work.

Then again, a real theory of protein folding, which would actually be able to fold a protein correctly (i.e. Freed’s approach [5]), would be a lot more detailed than a simple spin glass model.  Likewise, real Deep Learning systems are going to have a lot more engineering details (Dropout, Pooling, Momentum) than a theoretical spin glass model.  Still, hopefully we can learn something by looking at the spin glass models from theoretical chemistry and physics.  So…

In this next series of posts, we are going to look at some depth at LeCun’s recent analysis, at some fundamentals of condensed matter theory of spin glasses, and just what is going in current Deep Learning research and the applied methods.   [1] LeCun et. al.,  The Loss Surfaces of Multilayer Networks, 2015 [2] Spin glasses and the statistical mechanics of protein folding, PNAS, 1987 [3] THEORY OF PROTEIN FOLDING: The Energy Landscape Perspective, Annu. Rev. Phys. Chem. 1997 [4] From Levinthal to pathways to funnels, Nature, 1997 [5]  Mimicking the folding pathway to improve homology-free protein structure prediction, 2008 

The Bitcoin Crash and How Nature Works

Last week, On Jan 14, 2015 Bitcoin dropped below $200 (USD)

coindesk-bpi-chart

2 years of bitcoin 2103-2015

Moreover, several Bitcoin exchanges have shut down due to hacks and/or criminal activity, and key BitCoin players are on trial. This begs the question:

 Why is BitCoin Crashing ?

Markets crash all the time.  Stock markets.  Currency Markets.   Even the Dutch Tulip market of 1637 crashed–although even this is still hotly debated.

Fraud?  Speculation?  Mania?  Lack of Regulation?  What gives?

Market Crashes: the EconPhysics Perspective

In 1996, researchers at the University of Chicago[1] and elsewhere [2] independently proposed that market crashes resemble physical crashes such as earthquakes, ruptures, and avalanches.  Such phenomena arise from long range, collective fluctuations — i.e. herding behavior — that overtakes a system as it approaches a critical point…and/or in the aftershocks.

Discrete Scale Invariance in the SP500 Crash of 1987  [1]
Discrete Scale Invariance in the SP500 Crash of 1987 [1]
The theory relies upon a fairly generous application of the theory of critical phenomena and the Renormalization Group (RG) theory.  It provides both a qualitative description of the phenomena and quantitative predictions such as when a crash may occur or how long a bear market will continue.

Most notably, the strongest proponent, Didier Sornette, accurately estimated the long bear run of the Japanese markets–what he terms a Financial Anti-Bubble:

nikkei

He has written numerous papers, a book, and even a TED Talk.

We also introduced these ideas in a previous post:  Noisy Time Series II: Earth Quakes, Black Holes, and Machine Learning

Power Laws and Symmetry Breaking

It is well known that financial markets display non-Gaussian, long tail, power law-like behavior.  After a crash model, the price series itself p(t) may follow a power law

p(t)=A+B(t-t_{c})^{\alpha}

where t_{c} is the time of the crash, the time t>t_{c} , and \alpha is a real number.

We can fit the recent Bitcoin EOD (End of Day) prices to a power law, starting at it’s peak on Nov 11, 2013:

power_law

We obtain a modestly good least squares fit, with \alpha=0.32 and R^{2} = 0.77 .  Clearly there is a lot of noise–or is there?

RG theory suggests that near a crash, the power law may become complex (i.e a form of  ‘symmetry breaking’).  We will now sketch how this comes about, using the nobel prize winning Renormalization Group theory.

Complex Power Laws and the Renormalization Group

When examining a price series p(t) , or some other phenomena, say F(x) , a critical point occurs when we observe a rapid, sharp spike–i.e. Bitcoin’s fast rise up to Nov 29, 2013.

Here, we will find that as x\rightarrow 0

F(x)=A+B(x)^{\alpha} ,

where \alpha is a complex critical exponent

We say F(x) becomes singular near a crash when there exists a k-th derivative that becomes infinite as x\rightarrow 0 :

\lim_{x\rightarrow0}\dfrac{d^{k}F(x)}{dx^{k}}=0

We also know that near a critical point, physical systems exhibit scale invariant (i.e. fractal) behavior. The noise, or fluctuations, we observe on small time scales x=t-t_{c} look just like fluctuations on larger time scales \phi(x) .

We call \phi(x) the RG flow map.  The flow map describes these change of time scales, and the RG equation describes the invariances.

The fundamental RG equation (eqn) is:

x\rightarrow\phi(x)

F(x)\rightarrow g(x)+\dfrac{1}{\mu}F(\phi(x))

We assume F(x) is continuous–which is non trivial for price series –and that \phi(x) is differentiable. g(x) is the non-singular (regular) part of F(x) .

The magic of the RG theory is that it allows us to describe the behavior near a critical point knowing only the flow map \phi(x) and the regular function g(x) .  The formal solution is an infinite order power series

F(x)=\underset{n\rightarrow\infty}{\lim}f^{n}(x) ,

f^{n}(x)=\sum_{i=1}^{n}\dfrac{1}{\mu^{i}}g(\phi^{[i]}(x))

We will sketch a very simple, leading order approximation to F(x)  that leads to a complex power law.

RG Theory and Discrete Scale Invariance

We start by first assuming power law behavior (at lowest order):

F_{0}(x)\sim x^{\alpha}

We seek the simplest RG solution.  Let us assume the RG flow map is linear in x :

\phi(x)=\lambda x+\cdots ,

where \lambda > 1 (for technical reasons to ensure the critical point is an unstable fixed point of the RG flow).

We will ignore the regular solution g(x) and write the simplified RG eqn:

F(\lambda x)=\mu F(x)

For our simple power law, we have

(\lambda x)^{\alpha}=\mu x^{\alpha}

This is easy to satisfy if

\dfrac{\lambda^{\alpha}}{\mu}=1

or

\alpha=\dfrac{log(\mu)}{log(\lambda)}

We seek the most general solution, applicable to discrete physical systems, such as earthquakes, ruptures in materials–and financial market crashes.   That is, we expect \lambda to take on discrete values:

\lambda\in[\lambda_{1},\lambda_{2},\cdots]

We call this Discrete Scale Invariance (DSI)

The most general solution of DSI is

F(x)=x^{\alpha}P(\dfrac{\log x}{\log \lambda})

where P() is a general periodic function.  As above, we express P() as the limit of an infinite power series

P_{n}(x)=\sum_{i=1}^{n}c_{n}\exp\left[2n\pi i\left(\dfrac{\log x}{\log\lambda}\right)\right]

This is a classic Weierstrass function–a pathological function that, in the infinite limit, is continuous everywhere and yet differentiable nowhere.  It is used to model fractals, natural systems that are scale invariant, and highly discontinous but structured time series.

In a crash scenario, we expect that the true power series for P(x) will diverge; i.e. the k-th derivative explodes.  So here, we only take the leading term of our linearized model

exp\left[2n\pi i\left(\dfrac{\log x}{\log\lambda}\right)\right]

Note that it takes the form

F(x)=F_{0}(x)p(\log F_{0}(x))

so F_{0}(x), simple power law behavior, is like the regular part of the solution.

We now see that we have a complex critical exponent \alpha :

\dfrac{\lambda^{\alpha}}{\mu}=e^{2\pi n}

or

\alpha=\dfrac{log(\mu)}{log(\lambda)}+i\dfrac{2\pi n}{log(\lambda)}

We consider the real part of the complex power law

Re[x^{\alpha+i\beta}]=x^{\alpha}cos(\beta\log x)

which gives the following log periodic formula for the price series

p(t)=A+B(t-t_{c})^{\alpha}(1+C\cos(\omega\log(t-t_{c})+\phi))

Instead of fluctuating randomly around the underlying power law drift, the price series exhibits log-periodic oscillations–the DSI signature of a crash.

This model appears to work for 1-2 oscillations before (or after) the crash.  To model the longer time anti-bubble behavior, Sornette has developed an extended formula, based on the third-order Landau expansion (as opposed to, say, additional terms in the power series defined above).  We leave these details and further analysis for later.

DSI Fit of the Bitcoin Crash

We now fit the latest Bitcoin behavior, assuming the crash / Bear market started on Nov 29, 2013 :

rg

We readily find a DSI fit, with R^2=0.89 .

It is actually quite difficult to get a very tight fit, and usually advanced methods are needed. This is a simple, crude fit done to illustrate the basic ideas.

We see that Bitcoin, like other financial markets, displays log periodic behavior, characteristic of a self-organized crash.

 These kinds of crashes are not caused by external events or bad players–they are endemic to all markets and result from the cooperative actions of all participants.

We can try to detect them, perhaps even profit off them, but it is unclear how to avoid these Self-organized Critical points that wreck havoc on our finances.

Conclusion: How Nature Works

Bitcoin is an attempt to create a new kind of currency–a CryptoCurrency–that is free from both institutional control and individual corruption.  It is hoped that we can avoid the kind of devastating inflation, devaluation, and crashes that occur all too frequently in our current worldwide banking systems.  The promise is that Bitcoin, backed by the Blockchain, can remove our dependence on a single, faulty institution and replace it with a decentralized, distributed form of trust.  But

“Real life operates at a critical point between order and chaos”

Our analysis here shows that even the Bitcoin economy still appears to behave like a traditional market–prone to kinds of crashes that frequently arise in all natural self organized systems.

This Self-Organized Criticality is perhaps just how nature works [6,7].

And while Bitcoin is certainly interesting and exciting, perhaps it too is subject to the natural laws of physics.

References

[1] 1995  James A. Feigenbaum and Peter G.O. Freund,  Discrete Scaling in Stock Markets Before Crashes

[2] TED Talk: How We Can Predict the Next Financial Crises

[3] What will upend the economy next? Economist didier sornette explains

[4] D. Sornette, Critical market crashes

[5] 1999 Predicting Financial Crises using Discrete Scale Invariance

[6] 1999 A. Johansen and  D. Sornette, Financial “Anti-Bubbles”: Log-Periodicity in Gold and Nikkei collapses

[7] Anders Johansen and Didier Sornette, Evaluation of the quantitative prediction of a trend reversal on the Japanese stock market in 1999 , 2000

[8] 2003 Wei-Xing Zhou and  Didier Sornette  Renormalization group analysis of the 2000–2002 anti-bubble in the US S&P500 index: explanation of the hierarchy of five crashes and prediction 

[9] 20122 Didier Sornette, Ryan Woodard, Wanfeng Yan, and Wei-Xing Zhou  Clarifications to Questions and Criticisms on the Johansen-Ledoit-Sornette Bubble Model

[10] http://guava.physics.uiuc.edu/~nigel/courses/563/essay/lin.ps.gz

[11] Per Bak,  “How Nature Works: The Science of Self-Organized Criticality”,  1996

SVM+ / LUPI: Learning Using Privileged Information

Recently Facebook hired Vapnik, the father of the Support Vector Machine (SVM) and the Vapnik-Chrevoniks Statistical Learning Theory.

Facebook Hires Vapnik

Lesser known, Vapnik has also pioneered methods of Transductive and Universum Learning.  Most recently, however, he has developed the theory of the Learning Using Privileged Information (LUPI), also known as the SVM+ method.  In this post, we examine this most recent contribution.

Preface

Once again, we consider the problem of trying to learn a good classifier having a small set of labeled examples x_{l}\in L .

Here, we assume we have multiple views of the data; that is, different, statistically independent feature sets \phi(x),\phi*(x) , that describe the data.

For example, we might have a set of breast cancer images, and some holistic, textual description provided by a doctor.  Or, we might want to classify web pages, using both the text and the hyperlinks.

Multi-View Learning is a big topic in machine learning, and includes methods like SemiSupervised CoTraining, Multiple Kernel Learning, Weighted SVMs, etc.  We might even consider Ensemble methods as a kind of Multi-View Learning.

Having any extra, independent info x^{*} about our data is very powerful.

In LUPI, also called the SVM+ method, we consider situation where we only have extra information x^{*}_{l}  about the labeled examples x_{l}\in L .

Vapnik calls \{x^{*}\} Privileged Information  – and shows us how to use it

The VC Bounds

If we are going to discuss Vapnik we need to mention the VC theory.  Here, we note an important result about soft and hard-margin SVMs:  when the training set is non-separable, the VC bounds scale as \dfrac{h}{\sqrt{L}} :

P_{test}\;\le\;\nu_{train}+O\left(\dfrac{VCdim}{\sqrt{L}}\right)

where \nu_{train}  is the training error, and h is the VC dimension

But when the training set is separable,  i.e. \nu_{train}=0 , the VC bounds scale as \dfrac{h}{L}

P_{test}\;\le\;O\left(\dfrac{VCdim}{L}\right)

This means we can use say L=300 labeled examples as opposed to L=900,000, which is a huge difference.

It turns out we can also achieve \dfrac{h}{L} scaling–if we have an Oracle that tells us, a priori, the slacks.  Hence the term Oracle SVM.    The Oracle tells us how much confidence we have in each label.

So if we can get the slacks \{\varepsilon_{l}\} , or, equivalently, the weights, for each instance, L can be much smaller.

[Of course, this is only true if we can sample all the relevant features.]

Soft Margin SVM

In practice, we usually assume data is noisy and non-separable. So we use a soft-margin SVM where we can adjust the amount slack with the C (cost) parameter,

Note that we also have the bias b , but we can usually set to zero (still, be careful doing this!).

Let us write the soft-margin SVM optimization as

\arg\min\left(\dfrac{1}{2}\langle w,w\rangle+C\sum_{l\in L}\varepsilon_{l}\right)

subject to the constraints

y_{l}\left(\langle w_{l},x_{l}\rangle+b\right)\ge 1-\varepsilon_{l}\;\;\forall\;l\in L,\;\;\varepsilon_{l}\ge 0

Notice that while we formally add slack variables \{\varepsilon_{l}\} to max-margin optimization–they eventually vanish.  That is, we don’t estimate the slacks; we replace them with the maximum margin violation, or the Hinge Loss error

\hat{err_{l}}=\max[1-y_{l}(\langle w,x_{l}\rangle+b,0)]

We then solve the unconstrained soft-margin SVM optimization, which is a convex upper bound to the problem stated above–as explained in this nice presentation on LibLinear.

\arg\min\left(\dfrac{1}{2}\langle w,w\rangle+C\sum_{l\in L}\max[1-y_{l}(\langle w,x_{l}\rangle+b,0)]\right)

or, in simpler terms

\arg\min\left(\dfrac{1}{2}\langle w,w\rangle+C\sum_{l\in L}\hat{err_{l}}\right)

And this works great–when we have lots of labeled data L.

What if we could actually  estimate or assign all the L slack variables \{\varepsilon_{l}\} ?  In principle, we can far use labeled examples L.  This is the promise of LUPI.

LUPI/SVM+

In LUPI,  we use the privileged information \{x^{*}_{l}\} to learn the slack variables \{\varepsilon_{l}\} :

\varepsilon_{l}=\langle w^{*},x_{l}^{*}\rangle+b^{*}

We model the slack as linear functions of the privileged information.

Effectively, the privileged information provides a measure of confidence for each labeled instance.

To avoid overtraining, we apply a max-margin approach.  This leads to an extended, SVM+ optimization problem, with only 2 adjustable regularization parameters C,\gamma (and 2 bias params):

\arg\min\left(\dfrac{1}{2}\langle w,w\rangle+\dfrac{\gamma}{2}\langle w^{*},w^{*}\rangle+C\sum_{l\in L}[\langle w^{*},x_{l}^{*}\rangle+b^{*} ]\right)

subject to the constraints

y_{l}\left(\langle w_{l},x_{l}\rangle+b\right)\ge 1-[\langle w^{*},x_{l}^{*}\rangle+b^{*}],

\langle w^{*},x_{l}^{*}\rangle+b^{*}\ge 0\;\;\forall\;l\in L

Show me the code?

The SVM+/LUPI problem is a convex (quadratic) optimization problem with linear constraints–although it does require a custom solver.  The current approach is to solve the dual problem using a variant of SMO.  There are a couple versions (below) in Matlab and Lua.  The Matlab code includes related SVM+MTL (SVM+ MultiTask Learning).  I am unaware of a open source C or python implementation similar to Svmlin or Universvm.

To overcome the solver complexity,  other methods, such as the Learning to Rank method, have been proposed.  We will keep an eye out for useful open source platforms that everyone can benefit from.

Summary

The SVM+/LUPI formalism demonstrates a critical point about modern machine learning research.  If we have multiple views of our data,  extra information available during training (or maybe during testing), or just several large Hilbert spaces of features, then we can frequently do better than just dumping everything into an SVM, holding our nose, and turning the crank.

References:

or how many ways can we name the same method?

Learning with Hidden Information

The SVM+ method: MIT Reading Notes

Learning with Non-Trivial Teacher

Learning to Rank Using Privileged Information

On the Theory of Learning with Privileged Information

SMO-style algorithms for learning using privileged information

Learning Using Privileged Information: SVM+ and Weighted SVM

Learning with Asymmetric Information

Comparison of SVM+ and related methods (SVM+MTL, rMTL, SVM+ regression, etc.)

Videos

Learning with Privileged Information

Simmons Foundation Lecture

Software

Matlab and CVX Versions

Milde LUA Code

Machine Learning with Missing Labels Part 3: Experiments

Another early draft of the next blog post

In this series of posts we look at Transductive and SemiSupervised Learning–an old problem, a hard problem, and a fundamental problem Machine Learning.  Unlike Deep Learning or large scale ML,

We want to learn as much as we can from as labeled little data as possible.

We focus on text classification.  Why?

  1. The feature space is simple — its just words.
  2. With the true labels, a simple SVM gives near perfect accuracy.
  3. Text satisfies the cluster assumption.

Why isn’t this done all the time?

We can generate good (not great) Semi Supervised text classifiers.   But

  1. creating the input data is hard.
  2. setting the input parameters is harder.
  3. we can’t always distinguish the good solutions from the bad.

The 1st post looked at (1); here we try to understand (2&3), and look for simple, practical heuristics to make these old methods useful.

We can think of the TSVM/S3VM as generating several potentially good labelings on the unlabeled data. To solve (3), we need to find the best labeling–which means judging the label and/or classifier quality, such as the label entropy , cluster overlap, etc.

Or, alternatively, we might try to average the good solutions, similar to the way Hinton et. al. have recently suggested averaging an ensemble of models to take advantage of Dark Knowledge.

Thanks to Matt Wescott for pointing out this PDF and discussions on it.

To explore this, we run experiments using some old, existing, open source Transductive SVM (TSVM) and SemiSupervised SVM (S3VM) codes:

Our work is guided by  the results in this 2014 paper on the QN-S3VM method, as applied to text data.

We thank Fabian Gieseke for help in reproducing and understanding these results.

For the TSVM, some available codes and algorithms are

  • SvmLight:  Graph Cutting Algo
  • SvmLin:  MultiSwitch (MS-TSVM) and Deterministic Annealing (DA-TSVM)
  • UniverSvm: Convex Concave Continuation Procedure (CCCP)
  • QN-S3VM: Quasi Newton SemiSupervised SVM

We can also compare the results to a SVM baseline, using either

All these codes are Unix standalone C programs except QN-S3VM and Scikit-Learn, which are Python.

We examine SvmLin and its MultiSwitch (MS-TSVM) method.

SvmLin is the TSVM benchmark for most studies.  After we work this up, we will move on to other codes listed. I also would like to test other TSVM/S3VM methods, such as S3VM-MIP, S4VM, S3VMpath, and many others, but may are MatLab codes and  I don’t have C or Python open source code readily available.  Or really the time :P

We do not consider cases of weakly-supervised learning, where the labels may be both missing and/or have errors… but let me point out  a 2013 paper on the WellSVM method.  It also compares the above methods and is worth a read.

I hope this post gives some flavor to what is a very complicated and fundamental problem in machine learning.  (And maybe take some ideas from Deep Learning to help solve the TSVM problem as well)

Methods: The real-sim Data Sets:

We care about text classification. To this end, we reproduce the performance of SvmLin, and, later,  other methods, on the real-sim text dataset (listed in Table 2 in the paper).  The real-sim dataset was used in the original SvmLin paper; it is an accepted baseline for these methods.

This data set consists of  72309 labeled documents: 22238 (+1) and 50071 (-1).

The fraction of positively labeled documents R^{+}_{true}=30.75% .

Creating the Labeled, Unlabeled, and Holdout Sets:

I have created an IPython Notebook (make_realsim_inputs.ipynb) which can read the real-sim data and generate the input files for the various C programs listed above: SvmLight, SvmLin, & UniverSvm.

We can see it in the NBViewer here

The real-sim dataset is split in half, and 3 data sets are created.   Half is used to train TSVM (labeled L and unlabeled U); the rest is a holdout set (HO) for testing.

Following Table 2, the notebook generates:

  • labeled (L) of sizes roughly: l=90, 180, 361, 1486, & 2892
  • an unlabeled set (U) of size u=36154-l.
  • a test or HoldOut set HO (of the remaining data)

Each data set (L,U,HO) also has ~0.31 fraction of labeled examples; U and HO are statistical replica’s of L, albeit larger.

The input files consist of a training file (or 2 files for SvmLin) and 3 test files (6 forSvmLin): labeled L, unlabeled U, and holdout HO.   They are in the SvmLight format.

I.e. the SvmLin input files, for l=90,  are:

filename                                          size (lines)                                   purpose

svmlin.train.90.examples            36154                 labeled + unlabeled  train tsvm

svmlin.train.90.labels                   36154

svmlin.testL.90.examples                 90                 train labeled baseline

svmlin.testL.90.labels                        90

svmlin.testU.90.examples            36154                evaluate unlabeled set

svmlin.testU.90.labels                   36154

svmlin.testHO.90.examples         36155                evaluate holdout set

svmlin.testHO.90.labels                36155

 

Grid Searches with gnu parallel

Astoundingly, rather than performing a full grid search, many research papers fix the regularization parameters, guess them using some crude Bayesian scheme, or attempt to transfer them from some other data set. This makes it really hard to evaluate these methods.

We use a ruby script and the gnu_parallel program to run the command line unix programs and grid search the parameters. The script also computes the final HoldOut accuracy and metrics such as the margin and label entropy.

svmlin.rb

Gnu parallel lets us easily grid search the regularization parameters by running the TSVM codes in parallel.

We do a broad grid search

W\in[.0001,0.001,0.01,0.1,1,10,100,1000,10000]

U\in[.0001,0.001,0.01,0.1,1,10,100,1000,10000]

R\in[0.25, 0.26, 0.27,\cdots,0.35,0.36,0.37]

by creating a directory for each run

 

parallel "mkdir A2W{1}U{2}R{R}" ::: .0001 0.001 0.01 0.1 1 ::: 1 10 100 1000 1000 ::: 0.28 0.29 0.31 0.32 0.33 0.34

and then running SvmLin in each directory, simultaneously

 

parallel "cd A2W{1}U{2}R{R}; $SVMLIN -A 2 -W {1} -U {2} -R {3} ../svmlin.train.examples.90 ../svmlin.train.labels.90 > /dev/null"  ::: 0.0001 0.001 0.01 0.1 1  :::  1 10 100 1000 10000 ::: 0.28 0.29 0.31 0.32 0.33 0.34

We can then evaluate each run on the U or HO data set.  We will find the optimal W and U are in the range of the original crude estimates W=0.001, U=1, and R=0.31

How can we evaluate the accuracy of our TSVM, here, and in a real world setting?

Ground Truth

Since we know the labels on both the U and the HO sets , these are a ground truth from which we can evaluate how well the best model(s) perform.  We can get an upper bound on the best possible Generalization Accuracy (on the HO set) by training an SVM on L+U using the true labels, and then applying this model to HO.

The best expected HO accuracy here is ~ 97 %

Also, note this is different from the Reconstruction Accuracy on HO, which is > 99 %.

 We might also obtain a better generalization accuracy with different features, such as applying GloVe or even unsupervised pre-training on L+U.  We examine this in a future blog.   We are, in particular, interesting in comparing the Deep Learning Auto-Encoders with Convex NMF, including recent variants applied to document clustering.

Baseline

We need a baseline for comparison.  The IPython Notebook computes a baseline accuracy for the labeled data set (L); this can also be done with LibLinear or even SvmLin (using -A 1).

We then run the small L model on the HoldOut (HO) set for each l=90,180,…,  yielding baseline accuracies of

split size size of L (l) HO Accuracy
0.0025 90 85 %
0.005 180 87.5 %
0.001 361 90 %
0.004 1446 93 %
0.008 2892 94 %

please note that these are computed from 1 random sample, and may be slightly different (by ~1%) for each run. Also, that the command line and scikit-learn liblinear have different defaults; we use C=10,fit_intercept=False.

Finding the Best Accuracy:

We are generating a large number of nearly equivalent models, parameterized by

[\Gamma_{L}(U,W,R^{+})]

The inputs are related to the objective function in blog post 1.

W=\gamma , U=\gamma' ,  and r=0.31

Every useful model will have the same reconstruction accuracy on the labeled examples L, and every model proposes a different soft labeling for the unlabeled examples U.

How can we compare different models \Gamma_{L}(U,W,R^{+}) ?

Essentially, we use a margin method to guess good labelings of the data, but we need an unsupervised heuristic to select the optimal labeling.   Can we simply apply Cross-Validation to the small labeled set L?  Or Leave One Out CV (LOO-CV) ?  What filters can we apply to speed up the calculations ?

Best Accuracy on the Labeled Set

SvmLin may produce inferior or even nonsense models.  But more subtly,  some models may even have a very high training  (on U), and a very high test accuracy (on HO), but a very low accuracy on L.  These are not useful  because  in the real world, we don’t know the labels on U or HO.

We always want to first filter the possible  solutions by selecting those with the best accuracy on the L.

Filter by Final Objective Function  (FOF)

Since we are minimizing the objective function, we only consider solutions with FOF < 0.01.  Note this is very different than simply assuming the best solutions has absolute minimum FOF across all input parameters.

 We discard all solutions with Final Objective Function (FOF) > 0.01

For l=180, for example, this reduces the set of possible solutions from 1430 to 143.

Maximum Margin Solution…NOT

Wait, can’t we just take the solution with the maximum margin (or 1/norm)

m=\dfrac{1}{\sum_{i}w_{i}^{2}}

No. Think about how we practice supervised learning; we train a model, and set the regularization parameters by cross-validation.  In the TSVM and SSL methods, we can can also apply CV (or LOO-CV), but only to the labeled set L.

The maximum margin solution is only best for a given set of input / regularization parameters (U,W,R). Every model \Gamma_{L}(U,W,R^{+}) induces a different labeling on U  (that is, they form an equivalence class on L, not L+U).    In fact, the best overall solution has the minimum margin amongst a broad scan of (U,W).

Screen Shot 2014-10-31 at 8.31.54 PM

where the output weights w_{i} are taken from

svmlin.training.examples.90.weights

Optimal Parameters for the HoldOut Test Accuracy

(this section is still under construction)

Cross Validation and Leave One Out on the Labeled Data

As with a standard SVM, one can try to apply Cross Validation (CV) to the labeled set L to select the best TSVM model.  Most academic studies run 5-fold CV on the L set.  This is harder, however, because

  1. when L is small, the i.i.d. assumption fails, so CV my give spurious results
  2. we really should use Leave One Out Cross Validation (LOO-CV), but this increases the run time significantly
  3. the SvmLin code does not include CV or LOO-CV
  4. the SvmLin DA-TSVM is unstable and I had to switch to MS-TSVM to complete this
  5. my laptop cant take any more over clocking so I had to pause this for bit

I will finish this section once I either

  1. modify SvmLin to run CV / LOO-CV
  2. modify and use UniverSvm and the CCCP method, which is an order of magnitude faster than TSVM
  3. and/or get this all working in IPython+Star Cluster so I can run LOO-CV at scale
Minimum Entropy of the Soft Labels

The SvmLin DA algo minimizes the entropy on unlabeled examples S_{u} (it is the key stopping criteria, sec 3.2.3).  Perhaps the model with minimum S_{u} generalizes best?

This is the essence of Entropy Regularization--a common component of many Semi Supervised Learning formulations.

We can evaluate the proposed soft labels \mu_{u} of the unlabeled set U, found in

svmlin.testU.examples.90.outputs

We convert soft labels to a probability p_{u} :

p_{u}=\dfrac{{\dfrac{{1}}{1+exp(-\mu_{u})}}}{\sum\dfrac{{1}}{1+exp(-\mu_{u})}}

and compute the Entropy on U

S_{u}=-\sum_{u}(p_{u}ln(p_{u})+(1-p_{u})ln(1-p_{u}))

We hope that TSVM solutions with minimum S_{u} will consistently yield a very good Generalization Accuracy (on the HO set) across a grid of search (U,W,R)–and preliminary results confirm this.

Let’s look at the l=90 case, and plot first the HO accuracy vs the Label Entropy S_u .

Screen Shot 2014-10-31 at 8.38.15 PM

(Recall we only take solutions with FOF > 0.01 and the best training accuracy on the true labeled set.)

We see at 3-4 distinct sets of solutions with nearly equivalent entropy across a wide range of HO accuracy, and 2 classes include improved solutions (HO Accuracy > baseline 85%)

We call these Equivalence Classes under the Label Entropy.

Other solutions also show that the minimum Entropy solution is the best solution–for a fixed R^{+}_{in} :

Screen Shot 2014-10-31 at 8.52.20 PM

Screen Shot 2014-10-31 at 8.54.24 PM

We select the minimum Entropy class of solutions.

Notice that the l=90 case is greatly improved above the 85% baseline, whereas l=1446 shows only a slight (0.5%) improvement–and 2-3% less than the best expected accuracy of 97%. We see the same results in the QN-S3VM and WellSVM papers, and suggests that

Transductive and SSL Learning may be limited to small labelled sets

where the percentage of labeled data is L\sim 10-15%  of U.

Predicted Fraction of Positive Examples

Recall we have to input an estimate R^{+}_{in} of the fraction of positive labels on U R^{+}_{U} .

We assume we estimate R^{+}_{U} to within 10% using about 100 examples (l\sim O(100))

We would prefer a high quality estimate if possible because the final, predicted \hat{R}^{+}_{U}  and \hat{R}^{+}_{HO}  appears to depend strongly on R^{+}_{in} .

Here we test varying the input R^{+}_{in} .  We expect the predicted \hat{R}^{+}_{HO} to be correlated with the generalization accuracies, and preliminary results also confirm this (note that all plots show the predicted fraction \hat{R}^{+}_{HO} ):

Screen Shot 2014-10-31 at 8.38.48 PM

The predicted fraction \hat{R}^{+}_{HO}  also forms Equivalence Classes \Gamma_{L,S}(R^{+}) under the Entropy S_{U} .

Screen Shot 2014-10-31 at 10.55.26 PM

Of course, we know the true R^{+}_{HO}=0.31 , so the minimum Entropy solution is not the absolute best solution.

Perhaps a more generalized form of the Entropy would be more discerning?   We would prefer a python toolbox like this one and need a way to fit the parameters.

We might also try to improve the TSVM search algorithm, as in the S4VM: Safe Semi-Supervised SVM methods.  The S4VM tries to improve upon the SvmLin DA-TSM algo using Simulated Annealing, and then selects the best solutions by evaluating the final cluster quality.  This looks promising for simple applications.  Indeed, one might even try to implement it in python using the recent basinhopping library.

Transduction and the VC Theory

We have said that the VC Theory is a really theory about Transductive Learning.

We see now that to apply the TSVM in practice, the unlabeled data set U really needs to be an accurate statistical replica, as in the VC Statistical Learning Theory for Transduction.

We have tested cases where we have a good replica, but we did not estimateR^{+}_{U}  well; we have not yet tested cases where the U  is NOT a good replica and we can only estimate R^{+}_{U} partially.  This needs to be done.

In the literature this is called using class proportions, and some recent estimatation methods have been proposed (and here)–although the QN-S3VM papers do not attempt this.  Recently Belkin has shown how to estimate this fraction using methods of traditional integral methods.

The TSVM accuracy may depends on how well one can estimate the fraction (r) of positively labeled examples.

Also, we presume that TSVM models overtrain, and are biased towards either the (+) or (-) class.

Screen Shot 2014-10-31 at 11.32.54 PM

For l=1446, we find a single class of minimum Entropy entropy solutions

Screen Shot 2014-10-31 at 11.31.29 PM

with S\approxeq-5.467 , W=0.001, U=1 and \hat{R}^{+}_{HO}\in[0.25,0.37] .

They all have a high HO Accuracy–but may be biased towards choosing (+) labels because the minimum S solution has \hat{R}^{+}_{HO}=0.36 , not 0.31.

If we can’t estimate the R^{+}_{U}  well, may be able to average of the minimum Entropy solutions, thereby canceling out the biases in the over-trained models (i.e. average all \hat{R}^{+}_{U}\in[0.28\cdots 0.34] TSVM models)

 This is what Dark Knowledge, is all about–creating a simpler model by ensemble averaging– and we hope these recent advances in Deep Learning can help with Transductive Learning as well.

It is noted that the recent WellSVM method also addresses this problem.  WellSVM uses a convex relaxation of the TSVM problem to obtain a min max optimization over the set of possible models allowed within a range of R^{+}_{U}   ; this is quite clever!

In a future blog, we will examine how the current TSVMs perform when tuning their parameters to obtain the best predicted fraction. If this is sound, in a future post, we will examine how to estimate the class proportions for text data.

Discussion

to come soon..stay tuned

Appendix

Each is TSVM is trained with a linear kernel.      The regularization parameters are adjusted to give both optimal performance on the initial training set (L), and best reconstruction accuracy on the (U) set.

Each program requires slightly different command line options for these.  There are also tuning parameters for each solver, and is necessary to set the tolerances low enough so the solvers can converge.  For example, even for a linear SVM, the primal solvers may not converge as rapidly as the dual solver, and can even give different results on the same data set (despite Von Neumann’s minimax theorem). And the default for sckit-learn LinearSVC has the Bias on, whereas LibLinear has the Bias off.  So be careful.

Below we show examples of training each the TSVMs on the labeled set (L), and evaluating the accuracy on a HoldOut set (HO), for l=90.  We show typical values of the regularization parameters, but these do need to be tuned separately.  We also show how to run the popular Liblinear program as a baseline (which uses the svmlight format); the paper uses LibSVM.

Baseline Linear SVM:

liblinear  -C 0.1  -v 5  svmlight.testL.90

MS-SvmLin:

svmlin  -A 2  -W 0.001 -U 1 -R 0.31   svmlin.train.examples.90 svmlight.train.labels.90

svmlin  -f   svmlin.train.examples.90.weights svmlight.testHO.examples.90 svmlight.testHO.labels.90

SvmLin DA:

svmlin  -A 3  -W 0.001 -U 1 -R 0.31    svmlin.train.examples.90 svmlight.train.labels.90

svmlin  -f   svmlin.train.examples.90.weights svmlin.testHO.examples.90 svmlin.testHO.labels.90

SvmLight Graph Cuts:

svm_learn -p 0.31 svmlight.train.90  svmlight.model.90

svm_classify  -C 100  svmlight.testHO.90  svmlight.model.90   svmlight.outHO.90

UniverSvm CCCP TSVM w/ramp loss:

universvm -C 1 -c 1 -s -0.5 -z 1 -o 1 -T universvm.testHO.90 universvm.train.90 

Machine Learning with Missing Labels Part 2: The UniverSVM

Ever wonder what Google DeepMind is up to?  They just released a paper on Semi-Supervised learning with Deep Generative Models.   What is Semi Supervised Learning (SSL)? In this series of posts, we go back to basics and take a look.

Most machine learning algos require huge amounts of labeled data to achieve high accuracy. SVMs, Random Forests, and especially Deep Learning, can take advantage of massive labeled data sets.

How can we learn when we only have a few labeled examples?

In SSL, we try to improve our classifiers by taking advantage of unlabeled data.

In our last post we described an old-school approach to this problem, the Transductive SVM (TSVM), circa 2006. Here, we examine a different approach to learn from unlabeled data–Infernece with the Universum–ala Vapnik 2006.

Plus, there is code!  The UniverSVM code has a TSVM and the Universum (USVM) approach.

Wait, doesn’t Deep Learning use unlabeled data in pretraining?

This is different.   Deep Learning uses huge amounts of labelled data.    We want to use as little labeled data as possible.  To me, this is the real test of a learning theory.

In this and the next few blogs, we will examine at several variants of SVMs– USVM, WSVM, SVM+, and Semi Supervised Deep Learning methods–all that achieve high accuracy with only a small amount of labeled data.   Also, we only consider text classification–images and other data sets require different methods.

Learning with Unlabeled Data

Let’s say with have a small number N_L of labeled documents, and a large number  N_U of unlabeled ones.   Can we build document classifier (i.e. a binary SVM) that generalizes well with just a little labeled data?

Screen Shot 2014-09-28 at 10.28.09 PM

According to the Vapnik and Chervonenkis Statistical Learning Theory (VC-SLT), our generalization accuracy is very losely bounded by the model complexity and the number of training documents N_L :

 R(\alpha)-R_{emp}(\alpha)\leq2\mathcal{R}_{m}(\mathcal{H})+\kappa\sqrt{\dfrac{ln(\frac{1}{\delta})}{N_L}}

which, in plain English, is just

Generalization Error \leq Model Capacity (2R_{m} ) + Size Effects (\  \sim\sqrt{\dfrac{1}{N_{L}}} )

The VC-SLT  inspired the SVM maximum margin approach (although they are not equivalent).

SLT recognizes that for any data set \mathcal{L} (of size N_{L} ), we may obtain multiple, equivalent models (i.e. all having the same training accuracy):

\Gamma_{L}=\left[\mathcal{H}_{1}\sim\mathcal{H}_{2}\sim\mathcal{H}_{3}\sim\cdots\right]

We should select the model \mathcal{H} from \Gamma_{L} with the smallest capacity 2R_{m}(\mathcal{H}) ; in an SVM, this means select the largest margin.

This is easiest to visualize when the slack or admissible error = 1

Screen Shot 2014-10-11 at 9.27.08 PM

Each model \Gamma_{L} is a set of hyperplanes that give same labeling. The left model is optimal because it has the largest ‘SVM Capacity’, which is essentially the volume carved out by the admissible hyperplanes.  Turns out, the best model also has the hyperplane with the largest margin, so

The maximun margin is a measure of the VC capacity…

but is it the best?

While the VC-SLT bound is very loose, it does suggest that we usually need a large number of labeled documents N_L  to obtain any reasonable production accuracy.

The SLT is, however, inherently a theory about Transductive Learning; the proof of the VC bounds requires first proving the Transduction bound (i.e. via Symmetrization). It has always been the dream of the VC-SLT program to develop a Transductive or SemiSupervised) method that can learn much faster than \dfrac{1}{\sqrt{N_{L}}} .

(By Semi-Supervised, we mean that the resulting model can be applied to out-of-sample-data directly, where as Transductive learning only applies to the known, unlabeled test set.  We will not distinguish between them here.)

We would hope to achieve \dfrac{1}{\sqrt{N_{L}+N_{U}}} by adding in unlabeled data. When N_U is very large, we win.

(Indeed, recently, Vapnik has shown it is actually possible to reduce this bound from \dfrac{1}{\sqrt{N_{L}}}  to \dfrac{1}{N_{L}} –if we can Learn Using Privileged Information (LUPI).   This means we can learn when N_{L}=320 whereas previously we needed N_{L}=100,000 !  But this actually is akin to assigning additional data or even weights to each labelled example–and this is not the same as using just unlabeled data.  Perhaps we can learn the weights from the unlabeled data–but that’s for another post.)    So we are left with the un-nerving statement

If the max margin is the right measure, then the TSVM should work very well…  

and yet, this has proved elusive.

What is the alternative ? Suppose instead of measuring the volume traced out by the hyperplanes, our models measured the volume between the convex hulls of the 2 classes:

Screen Shot 2014-10-11 at 9.44.55 PM

Notice that now, the labeling on the right is better.

This is a broader measure of the diversity of the equivalence class. In SLT, this is related to the VC Entropy (another measure of VC capacity).

The Universum approximates this volume–or rather the VC Entropy– via the Principle of Maximum Contradictions.  (In fact, the Universum is not the only way to do this, as we shall see.)  A clever idea–but something hard to achieve in practice.  Let us compare and contrast the TSVM and the USVM to understand how to select the data sets they need and their limitations.

Transductive SVM (TSVM)

Transductive Inference requires a statistical replica \mathcal{R} of the labeled data \mathcal{L}:

Screen Shot 2014-09-29 at 11.37.17 AM

The replica \mathcal{R} is not only the same size as \mathcal{L}, it has the same statistical qualities.  In particular, the label means converge:  <y_U>\rightarrow<y_L> as \lim N\rightarrow\infty . (i.e. in a well balanced binary classifier,  this is zero)

In theory, we can always create a replica (or phantom sample) because we assume the labeled data itself is drawn from some common empirical process.  In modern VC-SLT, we think of the symmetrization process that creates \mathcal{R} as a Rademacher process — meaning that we have replicated the training data but randomized the labels.

In practice, we need to select the replica from unlabeled data–and this is hard!

We hope that by adding unlabeled data, we can find the best solution by guessing the labels of the unlabeled data and then maximizing the margin on all the data.

Screen Shot 2014-09-29 at 2.32.29 PM

We can apply Transduction SVMs  if we can create a large, unlabeled data set \mathcal{U} that behaves like a statistical replica of \mathcal{U}, albeit much, much larger.   TSVMs allow us to increase the accuracy of a binary text classifier by adding in large amounts of unlabeled data.

Also, TSVMs extend the feature space, or hypothesis space \mathcal{H} .  This is critical for text classification since , frequently, we need to classify documents we have never seen before, and we encounter new words.  This is irrelevant for image classification.

If we have a collection of consumer blogs (about finance, beauty, sports, politics, …), with some labelled documents, and large amount of unlabeled ones.  We can create (1-vs-1) binary TSVM classifiers, such as finance vs beauty if we have a good way to select the unlabeled data, as described in our previous blog:

Screen Shot 2014-09-28 at 10.30.09 PM

Essentially, the unlabeled documents must consist only of finance and beauty blogs, and in the same ratio as the training data.

TSVMs only work well for simple binary ( 1 vs 1 ) classification, and only when the document classes form simple clusters.  They don’t work for multi-class classifier because they can not handle ( 1 vs all ) data sets.

So while TSVMs do work, not just any unlabeled data set will do.  Indeed, I personally believe the key to a good TSVM is creating a well balanced unlabeled data.  Or, equivalently, estimating the fraction r of (+/-) examples very well.      If the replica set is poor, or poorly understood, the TSVM results can be worse than just training an SVM on only the labeled data.

UniverSVM (USVM)

In 1998, and , later in 2006, Vapnik introduced a different kind of SVM–which also the allows learning from unlabeled data–but replaces the Principle of Maximum Margin with a more robust method called

the UniverSVM: the Principle of Maximum Contradictions

The idea is to add in data from classes that are significantly different from the 2 classes being separated:

Screen Shot 2014-09-28 at 10.31.50 PM

To create a finance vs beauty classifier, we would add labelled and/or unlabeled documents from other categories, such as parenting, politics, sports, etc.   We then want a binary classifier that not only separates the 2 labeled classes, but is also as different  from the other class–the Universum.

We create 2 replicas of the Universum–one with all (+) labels, and one all (-).    We then add unlabeled documents (blue circles)  that lie in the margin between classes–or really in the space between the (+/-) classes:

Screen Shot 2014-09-30 at 12.08.39 AM

The best Universum examples will lie in the convex hull between the finance and beauty documents; those inside the convex hulls will most likely be ignored.  Since all the u-labels are wrong, every equivalence class of hyperplanes will produce numerous contradictions on the Universum points (u):

Screen Shot 2014-10-11 at 11.58.51 PM

The best model has the largest diversity on the Universum; in other words, the largest VC Entropy.   Vapnik has pointed out the following interpretation of the Universum: “[When trying to classify the labeled data, try to avoid extra generalizations.]”

The best model has the Maximum # of Contradictions on the Universum.

The UniverSVM (USVM)  represents a kind of prior information that we add to the problem; the difference is that instead of specifying a prior distribution, which is hard, we replace this with a more practical approach–to specify a set of specific examples.

Notice we could have just created a 3-class SVM (and theUniverSVM code provides this for direct comparison).  Or we could build 2-class classifier, augmented with some clustering metric to avoid the other class--in the same spirit the S4VM method.   See, for example, the recent paper on EMBLEM.  But in these simple approaches, the other class must only contain other–and that’s hard.

USVMs compared to TSVMs:

In the TSVM we have to be sure we are adding documents that are in the same classes as the training data.  In the USVM, we must be sure we are adding documents that do not belong to either class.

Also, with a TSVM, we need to know the fraction of (+/-) documents, whereas with the USVM we don’t. The USVM would seem to admit some data from all classes–in principle. In practice, we will do much better if it does not.

Most importantly, unlike the TSVM, the USVM optimization is convex. So adding in bad data may will not cause convergence problems as in the TSVM and thus degrade the model. At least we hope.

Also, as with the TSVM, we suspect the USVM will only work for small N_{L} ; once N_{L} grows above say 5% of the total documents, we may not see much improvement.

The UniverSVM Algorithm

The SVM approach,  inspired by SLT, implements a data dependent, but distribution independent, regularizer that lets us select the best model \mathcal{H} from a class of equivalent hypotheses \Gamma_{L} .  What other methods select the optimal equivalence class \Gamma_{r} ?

the Principe of Maximum Power

A model H  consists of an equivalence class of hyperplanes, say h\in H_{\Gamma} .  They all have the same accuracy on the labeled data \mathcal{L} .

Suppose we know a prior distribution P(\omega) on the set of all possible hyperplanes.  Then we can define the power p* of the optimal model as

p*=\int_{H_{\Gamma}*}d\mathbf{h}P(\mathbf{h})

We rarely can define P(\mathbf{h}) mathematically..but we can approximate it.

Let us sample P(\mathbf{h}) by selecting N_{U} unlabeled example documents; we call this set the Universum (\mathcal{U} ).

We call the practical method the UniverSVM.  The key insight of the UniverSVM is that while we can not compute this integral, we can estimate it by measuring the number of contradictions the class of hyperplanes generates on points in \mathcal{U} .  

The Principle of Maximum Contradictions:

We need a way to select the equivalence class \Gamma with the maximum number of contradictions on \mathcal{U} .  As usual, we create a regularizer.

We augment the standard SVM optimization problem with a Universum regularizer \Vert U(f(x_u))\Vert

\frac{1}{2}\Vert\omega\Vert_{2}^{2}+C_{l}\sum_{l\in\mathcal{L}}H(y_{l},f(x_{l},b))+C_u\Vert U(f(x_u))\Vert

where H is the standard SVM Hinge Loss.

The regularizer \Vert U(f(x_u))\Vert can also be defined through a symmetric Hinge loss, making the problem convex.  The UniverSVM code also contains a non-convex variant using a ramp-loss approach.  We leave the details to the academic papers.

Laplacian SVMs and the The Maximum Volume Principle

About the same time Vapnik introduced the UniverSVM, Nyogi (at the University of Chicago) introduced both the TSVM SvmLin code, as well as a new approach to Semi-Supervised learning, the Laplacian SVM (or LapSVM).

The TSVM maximizes the margin for the labelled and unlabeled data.  The USVM maximizes the contradictions on the unlabeled (UniverSVM) set.   In fact, both approximate a more general form of capacity–the maximum volume between the convex hulls of the data sets.  And this can be approximation using Laplacian regularization!  Let’s see how this all ties together.

The LapSVM optimization is

\frac{1}{2}\Vert\omega\Vert_{2}^{2}+C_{l}\sum_{l\in\mathcal{L}}H(y_{l},f(x_{l},b))+C_{u}\Vert\mathcal{L}\Vert

We see it is very similar to the USVM, but with a different regularizer –norm of the Graph Laplacian \Vert\mathcal{L}\Vert. There are several \mathcal{L} to choose from but LapSVM uses the one corresponding to the Laplace-Beltrami operator on the data manifold.

The LapSVM has recently been applied to text classification, although we dont have a C or python version of the code to test like with do with SvmLin and the UniverSVM.

LapSVMs and related Manifold Learning approaches have motivated some very recent advances in Semi-Supervised Deep Learning methods, such as the Manifold Tangent Classifier.  This classifier learns the underlying manifold using contractive auto-encoder, rather than using a simple Laplacian–and this seems to work very well for images.

(We note that the popular SciKit Learn package includes Manifold Learning, but these are only the Unsupervised variants.)

We can relate the LapSVM approach to the VC-SLT, through the Principle of Maximum Volume:

the Norm of the Graph Laplacian is measure of the VC Entropy

Let us again assume we have to select the best equivalence class \Gamma^{*} on our labeled data \mathcal{L} from a set of hyperplanes \mathbf{h}\in H_{\Gamma}* .    Lacking a specific prior , we can assume a uniform distribution P(\mathbf{h})=1 .

We then simply need to approximate the volume

V_{r}(\mathbf{h})=\int_{\Gamma_{r}}d\mathbf{h}

We need a way to compute V, so we assume there exists an operator \mathbf{Q} such that

V_{r}(\mathbf{h})\equiv\dfrac{\mathbf{h^{\dagger}Qh}}{\mathbf{h^{\dagger}h}}

To this end, Vapnik et. al. introduce “a family of transductive algorithms which implement the maximum volume principle“, called Approximate Volume Regularization (AVR) algorithms(s).  They take the form

\min Loss_{l}(\mathbf{h})+C_{u}\dfrac{\mathbf{h^{\dagger}Qh}}{\mathbf{h^{\dagger}h}}

For many problems, \mathbf{Q} can just be the Graph Laplacian

\mathbf{Q}=\mathcal{L} ,

the specific form specified by the specific AVR.

If we write the general form of \mathcal{L} as

\mathcal{L}=I-D^{-\frac{1}{2}}WD^{\frac{1}{2}}

W can be defined using the Gaussian Similarity

W_{i,j}=exp(\dfrac{-\Vert x_{i}-x_{j}\Vert^{2})}{2\sigma^{2}}

which has a single adjustable width parameter \sigma .

A more robust Laplacian uses the Local-Scaling Similarity

W_{i,j}=exp(\dfrac{-\Vert x_{i}-x_{j}\Vert^{2})}{2\sigma_{i}\sigma_{j}}

which has N adjustable parameters \sigma_{i} .

the Maximum Volume Principle may be a better measure of VC capacity

This Principle of Maximum Volume has been applied in a number of ways, such as in recent papers on Maximum Volume Clustering and , and this one for Outlier Detection.

Most importantly, a very recent paper proposes a MultiClass Approximate volume Regularization Algorithm (MAVR).

We have also connected to seemingly different approaches to machine learning–traditional VC theory and operator theoretic manifold learning.  In a future post, we will look at some practical examples and compares how well the available open source TSVM and USVM codes on text data.

Further Reading

Practical Conditions for Effectiveness of the Universum Learning

Practical Analysis of the Universum SVM Learning

Advances in Universum Learning

Thesis:  Analysis and extensions of Universum learning (2014)

Machine Learning with Missing Labels: Transductive SVMs

SVMs are great for building text classifiers–if you have a set of very high quality, labeled documents.

Frequently, we just don’t have enough labelled data!  What can we do?

  • We can crawl the web for more labelled data. For this purpose, here at Calculation Consulting we built a production quality crawler
  • Pay mechanical Turks to label the documents.  This is actually harder than it sounds.
  • We can use what labels we have, and try to guess the labels of the unlabeled documents.

Guessing labels is called Transductive Learning and is the topic of this post.

SVMs and Self-Training

We have a huge collection of documents, perhaps with just a few labels.  The obvious, or perhaps naive, thing to do is build a classifier using what we have, and then try to scale it out incrementally by self-training.  That is, we select documents labeled that are labelled high confidence by the smaller model and use them to build a training data set for a larger SVM model:

Screen Shot 2014-09-24 at 4.12.34 PM

So this seems pretty simple — what could go wrong? Since the initial model has very little labeled data, it is going to make lots of mistakes, even in the high confidence regime.   We are adding in some data that is mis-labeled.    As we incrementally self- train,  we amplify these errors, thereby reducing overall accuracy.  So what can be done?

Transductive Learning

Suppose that, as we add the new data in, instead of just naively using our smaller model to label the new data, we could automatically somehow switch the labels on these new documents, and then pick the best model from all these possibilities.    That is, we run an SVM where we retain the labels on the labeled (blue) documents, but we let the labels on the test (purple) documents float.  We constrain the problem so that at least r unlabeled documents are labeled positive.  We then pick the model with the largest margin. This leads to the following non-convex optimization problem for the Transductive SVM (TSVM)

\min\dfrac{\lambda}{2}\|w\|^{2}+\dfrac{1}{2L}\sum\limits_{i=1}^{L}l(y_{i},w^{T}x_{i})+\dfrac{\lambda'}{2U}\sum\limits_{j=1}^{U}l(y_{j},w^{T}x_{j})

subject to the constraint \dfrac{1}{U}\sum\limits_{j=1}^{U}\max[0,sign(w^{T}x_{j}')]=r

where the l is the loss function (such the SVM hinge loss or, as here, the L2_SVM loss).  There are L labelled documents (x_{i}) and U unlabeled documents (x_{j}) , and \lambda,\lambda' are adjustable parameters.  The constraint r  fixes the number of positively labeled documents; this constraint I worry about the most because it is the one I use to tune the TSVM parameters — but let’s just assume we can estimate this and move forward.

Why does Adding Test Data to the Training Set Improve the Accuracy?

When we don’t have enough training data, the SVM optimization problem is still convex, but the optimal solution may be spurious.  Imagine that we are trying to train a binary text classifier with only say 100 documents in each class.  The classifier will learn how to separate the documents, but it will not be able to generalize to new documents because it simply has not seen enough examples or really enough words to build a usable feature space.    The solutions may be spurios.

 

margin Indeed, there may exists many margins that separate the data, and the largest may not be the one that generalizes best.  That is, without enough training data, the SVM may mis-learn and think that noise words are driving the  classification.

We want is a classifier that generalizes to unseen test documents; 

if we don’t have test labels, perhaps we can guess them.

When we add unlabeled documents , or test data (circles) , to the classifier training data.  If we can guess the labels of the circles correctly, we can find the optimal solution because it is less ambiguous:

margin-tsvm

Notice that since we are seeking a maximum margin solution, we implicitly are assuming that the documents form clusters in feature space because the margin lives in the low density regime between document clusters.  We are also are increasing the size of feature space, thereby greatly improving the generalizabilty of the model.

TSVM Algorithm: Deterministic Annealing

Unlike a standard (inductive) SVM, the TSVM optimization problem is highly non-convex and quite difficult to solve exactly.  Clearly we can not switch the labels on all the new documents; if we add in N documents, then we have 2^{N} combinations to evaluate.  So how do we solve it?

There are several known approaches to solving the TSVM–both of which are readily available in the open source package SvmLin:

  • Label Switching Heuristics  (-A 2) , and
  • Deterministic Annealing (-A 3)

We will discuss the Deterministic Annealing (DA) algorithm because it is very nice example of how to combine the ideas of SVMs with the techniques of Statistical Mechanics discussed in our last post.   Other, more modern Transductive learning algos may use alternate methods such as convex-concave programming  (CCCP, available in Universvm  [11]) ,  and even quasi-Newton gradient descent (QN-S3VM) [10].

In the original paper [3], TSVM  refers to option  -A 2, and DA to -A 3.  Here, we use the term TSVM to mean any implementation (TSVM, DA, CCCP, QN-S3VM).   Because we are applying the TSVM to text, however, we do not advocate using label propagation or other graph based methods.

Combining SVMs with Statistical Mechanics

We previously, discussed basic Statistical Mechanics and Simulated Annealing, where we derive the concept of Entropy.

To solve the  TSVM / DA optimization problem, we need relax the constraint on the labels that they be integers, and, instead let them represent the probability of being in the (+) or (-) class.  We then introduce a temperature T .   First, lets get some labels that range from 0 to 1:

\mu_{j}=\dfrac{1+y_{j}}{2} so that

\mu_{j}=1 when y_{j}=1 ,  and

\mu_{j}=0 when y_{j}=-1

This lets us re-write the TSVM optimization problem as

\min\dfrac{\lambda}{2}\|w\|^{2}+\dfrac{1}{2L}\sum\limits_{i=1}^{L}l(y_{i},w^{T}x_{i})+\dfrac{\lambda'}{2U}\sum\limits_{j=1}^{U}(\mu_{j}l_{2}(y_{j},w^{T}x_{j})+(1-\mu_{j})l_{2}(y_{j},-w^{T}x_{j}))

where we write l_{2} to specifically indicate that the loss function is the L2-SVM loss.

We now take a trick from the Ising model in Statistical Mechanics.  We want to transform the labels

\mu_{j}\rightarrow p_{j}

into probabilities that range from 0 to 1:  p_{j}\in[1,0]

and are normalized to the total number of expected positively labeled instances: \dfrac{1}{U}\sum_{j=1}^{u}p_{j}=r

How do we incorporate probabilities into our SVM optimization problem?  We are already selecting the maximum margin of the training data; we now need to also select the optimal probabilities — the minimum Entropy — of the labels for the unlabeled data.  

We extend the TSVM max-margin problem, adding a term that maximizes the total Entropy  S of the possible configurations of the unlabeled test data

S=-\sum_{j=1}^{u}p_{j}\ln(p_{j})+(1-p_{j})\ln(1-p_{j})

Actually, since we minimize \|w\|^{2} , we actually minimize the negative Entropy -S .  This leads to a new TSVM / DA optimization problem

\min\dfrac{\lambda}{2}\|w\|^{2}+\dfrac{1}{2L}\sum\limits_{i=1}^{L}l(y_{i},w^{T}x_{i})+\dfrac{\lambda'}{2U}\sum\limits_{j=1}^{U}(\mu_{j}l_{2}(y_{j},w^{T}x_{j})+(1-\mu_{j})l_{2}(y_{j},-w^{T}x_{j}))+\dfrac{T}{2U}\sum_{j=1}^{u}p_{j}\ln(p_{j})+(1-p_{j})\ln(1-p_{j})

where the additional parameter  T is a Lagrange Multiplier for Entropy S .  We also have to add in the normalization (and class balancing) constraint

\dfrac{1}{U}\sum_{j=1}^{u}p_{j}=r

Handing this constraint is a bit more complicated, and we refer to [3];  we do note that the constraint can be re-written into what looks like a Fermi energy, which is quite interesting.

We maximize the Margin on the training (and test) sets, 

and minimize the Entropy on the test set.

Solving this optimization problem requires alternating between solving the labeled SVM optimization problem (maximizing the margin w ) and finding the optimal probabilities for the unlabeled instances (optimizing p_{j} ).  The result are labels (with confidences) for our previously unlabeled data, and, if we want to use it, a new , inductive SVM model.

We also need to be careful not to overtrain.  So I will select out the newly labeled data with the highest confidence labels and then use this as training data for the production SVM (i.e LibLinear).  This would allow the data scientist to then take a label set of say 100 documents per class, and automatically extend it to say 10,000 labeled documents per class without the use of mechanical Turks or crawling the web.

Despite the wonderful advantages promised by Transductive (and SemiSupervised) Learning, it is quite hard to apply in practice and has not achieved even close to the popularity of its Supervised and Unsupervised counterparts.  TSVMs need both proper tuning [8] and well designed data sets.  We need to understand it’s practical implementation much better to use it.

Practical Issues:  Selecting the best labelled data

An initial training data set may be naively separable if it is small and has a large number of features :

margin-tsvmnaively separable:

N_{instances}\ll N_{features}

 

 

As a data set grows, we may have far more instances than features:

instances_ll_featurespotentially unseparable:

N_{instances}\gg N_{features}

  

 

which makes it far more probable that some instances will lie in the slack region between classes.

We usually run a tool like LibLinear , which implements either a soft-margin SVM or L2-regularized Logistic Regression.  Indeed, in practice is usually impossible to tell the difference between these 2 models.  The additional slack terms can be though of as an extended regularizer or as modifying the SVM loss function.  Indeed, even Vapnik himself has argued that it can be very beneficial to run an SVM with a large number of adjustable slack variables (i.e in master learning and/or his universum model [5]). The SvmLin implementation of the TSVM is a hard margin (binary, no slack variables) SVM, but it does use the L2-SVM loss function (and not the harder hinge loss).

It is suggested to run the TSVM slowly and repeatedly to relabel documents, and, on each iteration, to only take new labels that have moderately high confidence (and therefore, hopefully, outside the slack region).

Of course, there is always the risk of biasing the data set, and it may be better to augment the selection process by selecting documents that are high confidence and close. For example, in the S4VM approach, they  high confidence documents that are also close to the labeled documents, as determined through a hierarchal clustering method.

In between each stage, it would be useful to apply a mechanical turk to check the results to ensure the newly labelled data is not bananas.

Practical Issues:  designing multiple 1-vs-1  data sets

Additionally, SvmLin only implements a binary SVM, whereas many practical applications require a multi-class SVM.  More generally, one might try to extend the TSVM to a DA type algorithm using say a Potts model–a classic model in physics that generalizes an Ising model. MultiClass Transductive Learning is a topic of current research.  For now we have to use SvmLin or a related package (Universvm, QN-S3VM) to get practical solutions.

In a normal SVM, balancing the class data is important; in a Transducitve-SVM, it is absolutely critical.  For example, this recent study on TSVMs discussed the issue.  This can be quite challenging in real world, production environments, and if you want to apply this powerful technique, extra care should be taken to design the TSVM (training + test) sets.

To apply SvmLin, it is necessary to carefully construct a collection of 1-vs-1 binary SVMs;  current TSVMs don’t work for 1-vs-all data sets [4]

For example, suppose we have a  4 class model, but we only have a few labeled instances for each class.  If we try to extend the set of instances for class 1, we need to carefully construct four 1-vs-1 binary TSVMs, and not one large data set for a 1-vs-all binary TSVM.  Each individual binary TSVM classifier should be designed carefully so that each class (+/-) consists of a single kind of document;.  Otherwise the solver will get trapped in a local solution and the TSVM will be worse.

Screen Shot 2014-09-23 at 2.53.33 PM

Furthermore, the fraction r  is a normalization constraint, and keeping it fixed means the negative set must be carefully constructed to ensure that final result has r=1/2 positively labeled instances that represent the actual positive class well.  This is not easy to ensure and it may be necessary to sample the unlabeled data using mechanical turk or advanced techniques to estimate the true fraction  r in the unlabeled data.

The TSVM can help add labeled data to a well designed model; it can not be used to repair a poorly designed classification model .

Consider a binary SVM trained on a small training set of cat and dog videos.   We then apply the model to some unseen test videos, and select a large number of videos that were labelled either cats or dogs.  The model will certainly make mistakes, but  we can hope that the errors are balanced across both classes.  So when we apply the TSVM to the entire training + test  set, half of the data will still actually be dog videos.

Screen Shot 2014-07-02 at 1.18.42 AM

On the other hand, let’s consider the poorly designed classifier shown below.  Here, we have 3 classes:  dogs, cats, and animals.  The third class is trained for animal videos, and may contain dogs, cats, horses, mice, etc.  This is a bad design, and we would see this, say, in the confusion matrix for the entire model.  Even if we create three 1-vs-1 TSVMs, we are not going to get anywhere.  When we create the increment set, to train the TSVM, we have no idea what fraction of the training + test set are actually dog videos.  So we can not really fix the normalization constraint r , and resulting TSVM model may behave screwy.  We would need a way to estimate r .
Screen Shot 2014-07-02 at 1.18.45 AM

Furthermore, the animals category will likely consist of a number a document clusters (dog, cat, horse, mouse, ferret, etc) and the TSVM solver will not converge properly for say the dog/animals binary TSVM.  The multiple clusters may confuse the solver (although since this may be an artifact of the DA algorithm getting trapped in local minima and not an inherent problem in the theory)

Screen Shot 2014-09-23 at 4.13.45 PM

 Ideally, each  (+/-) class must effectively capture a single, simple text cluster.

Epilogue

Transductive Learning was first proposed by more than thirty years ago by Vapnik and Chervonenkis [6] , and it is a critical idea in the development of the famous VC Theory of Statistical Learning [1].  In a future post, we will look in detail at the role of Transductive Inference in the VC theory and some other formulations such as the UniverSVM and Instance Weighted SVMs [6].

Practically,  there are few useful Transductive and Semi-Supervised ML algos to use :

also note

  • The famous scikit-learn package has 1 method, a semi-supervised Label Propagation algorithm. but this is not suitable for text
  • It appears the enterprise tool RapidMiner does implement a commercial version of a TSVM, although I personally have not tried it.

In our next post,  we will run some experiments with the SvmLin code and see how well it works on some real world datasets.

At some point in the futurem we wlll discuss [12] and what (we think) is going on at Google Deep Mind.

To get involved, goto    https://github.com/CalculatedContent/tsvm

References

[1]  Vladimir Vapnik. Statistical learning theory. Wiley, 1998. ISBN 978-0-471-03003-4.

[2] T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, ICML 1998

[3] V. Sindhwani, S. Sathiya Keerthi, Large Scale Semi-supervised Linear SVMs, 2006

[4] O. Chapelle, B. Scholkopf, A. Zien, Semi-Supervised Learning, 2006

[5]  http://www.cs.princeton.edu/courses/archive/spring13/cos511/handouts/vapnik-slides.pdf

[6] Vapnik, V., & Chervonenkis, A.  The Theory of Pattern Recognition, Moscow 1974

[7]  Ran El-Yaniv, Dmitry Pechyony, Transductive Rademacher Complexity and its Applications 

[8] J. Wang,  X. Shen,  W. Pan On Transductive Support Vector Machines

[9] C. Yu , Transductive Learning of Structural SVMs via Prior Knowledge Constraints

[10] Fabian Gieseke, Antti Airola, Tapio Pahikkala, and Oliver Kramer. Fast and Simple Gradient-Based Optimization for Semi-Supervised Support Vector Machines, Neurocomputing (ICPRAM 2012 Special Issue), Elsevier, accepted;  Fabian Gieseke, Antti Airola, Tapio Pahikkala, and Oliver Kramer. Sparse Quasi-Newton Optimization for Semi-Supervised Support Vector Machines, Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods (ICPRAM 2012), pages 45-54, 2012. [pdf]

[11]  Ronan Collobert ,Fabian Sinz  Jason Weston  Leon Bottou,  Large Scale Transductive SVMs   Journal of Machine Learning Research 7 (2006)

[12] Semi-supervised Learning with  Deep Generative Models  at Google Deep Mind