on Cheap Learning: Partition Functions and RBMs

Why does deep and cheap learning work so well?

This is the question posed by a recent article.  Deep Learning seems to require knowing the Partition Function–at least in old fashioned Restricted Boltzmann Machines (RBMs).

Here, I will discuss some aspects of this paper,  in the context of RBMs.


We can use RBMs for unsupervised learning, as a clustering algorithm, for pretraining larger nets, and for generating sample data.   Mostly, however, RBMs are an older, toy model useful for understanding unsupervised Deep Learning.


We define an RBM with an Energy Function


and it’s associated Partition Function

Z(\mathbf{v},\mathbf{h})=\sum\limits_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h}})

The joint probability is then


and the probability of  the visible units is computed by marginalizing over the hidden units


Note we also mean the probability of observing the data X, given the weights W.


The Likelihood is just the log of the probability


We can break this into 2 parts:


 F(\mathbf{v,h}) is just the standard Free Energy


We call F^{c}(\mathbf{v}) the clamped Free Energy


because it is like a Free Energy, but with the visible units clamped to the data X.

The clamped F^{c}(\mathbf{v}) is easy to evaluate in the RBM formalism, whereas F(\mathbf{v,h}) is computationally intractable.

Knowing the Z(\mathbf{v,h}) is ‘like’ knowing the equilibrium distribution function, and methods like RBMs appear to approximate \ln\;Z(\mathbf{v,h}) in some form or another.

RBM Training

Training an RBM proceeds iteratively by approximating the Free Energies at each step, and then updating W with a gradient step


RBMs are usually trained via Contrastive Divergence (CD or PCD).  The Energy function, being quadratic, lets us readily factor Z using a mean field approximation, leading to simple expressions for the conditional probabilities



and the weight update rule

d\mathbf{w}=\langle \mathbf{v}^{T}\mathbf{h} \rangle_{0}-\langle\mathbf{v}^{T}\mathbf{h}\rangle_{\infty}

positive and negative phases

RBM codes may use the terminology of positive and negative phases:

positive\;\;\langle\mathbf{v,h}\rangle^{+}\rightarrow negative\;\;\langle\mathbf{v,h}\rangle^{-}

\langle\mathbf{v,h}\rangle^{+} :  The expectation \langle\mathbf{v,h}\rangle_{0} is evaluated, or clamped, on the data.

\langle\mathbf{v,h}\rangle^{-} :   The expectation \langle\mathbf{v,h}\rangle_{\infty} is to be evaluated on the prior distribution p(\mathbf{X}|\mathbf{W}) .  We also say \langle\dot\rangle_{\infty} is evaluated in the limit of infinite sampling, at the so-called equilibrium distribution.  But we don’t take the infinite limit.

CD approximates \langle\mathbf{v}^{T}\mathbf{h}\rangle_{\infty} –effectively evaluating the (mean field) Free Energy   by running only 1 (or more) steps of Gibbs Sampling.

So we may see

d\mathbf{w}=\langle \mathbf{v}^{T}\mathbf{h} \rangle_{0}-\langle\mathbf{v}^{T}\mathbf{h}\rangle_{1}

or, more generally, and in some code bases, something effectively like

d\mathbf{w}=\langle \mathbf{v}^{T}\mathbf{h} \rangle^{+}-\langle\mathbf{v}^{T}\mathbf{h}\rangle^{-}

Pseudocode is:

Initialize the positive  (\mathbf{v,h})^{+} and negative (\mathbf{v,h})^{-}

Run N iterations of:

Run 1 Step of Gibbs Sampling to get the negative  (\mathbf{v,h})^{-} :

sample the hiddens given the (current) visibles: p(h^{-}_{i}=1|\mathbf{v})

sample the visibles given the hiddens (above): p(v^{-}_{i}=1|\mathbf{h})

Calculate the weight gradient:

dw_{i,j}=\langle \mathbf{v}^{T}\mathbf{h} \rangle^{+}-\langle\mathbf{v}^{T}\mathbf{h}\rangle^{-}

Apply Weight decay or other regularization (optional):


Apply a momentum (optional):


Update the Weights:


Energy Renormalizations

What is Cheap about learning ?  A technical proof in the Appendix notes that

knowing the Partition function Z(\mathbf{v,h}) is not the same as knowing the underlying distribution p(\mathbf{v}) .

This is because the Energy E(\mathbf{v,h})  can be rescaled, or renormalized, in many different ways, without changing Z(\mathbf{v,h}) .

This is a also key idea in Statistical Mechanics.

The Partition function is a generating function; we can write all the macroscopic, observable thermodynamic quantities as partial derivatives of \ln\;Z .  And we can do this without knowing the exact distribution functions or energies–just their renormalized forms.

Of course, our W update rule is a derivative of \ln\;Z(\mathbf{v,h})

The proof is technically straightforward, albeit a bit odd at first.

Matching Z(y) does not imply matching p(y)

Let’s start with the visible units \mathbf{v}= .  Write



We now introduce the hidden units, \mathbf{h} , into the model, so that we have a new, joint probability distribution


and a new, Renormalized , partition function


Where RG means Renormalization Group.  We have already discussed that the general RBM approach resembles the Kadanoff Variational Renormalization Group (VRG) method, circa 1975. This new paper points out a small but important technical oversight made in the ML literature, namely that

having Z^{RG}(\mathbf{v,h})=Z(\mathbf{v}) does not imply \tilde{p}(\mathbf{v})=\sum\limits_{\mathbf{h}}\tilde{p}(\mathbf{v})=p(\mathbf{v})

That is, just because we can estimate the Partition function well does not mean we know the probability distributions.

Why?   Define an arbitrary non-constant function K(\mathbf{v}) and write

\tilde{Z}(\mathbf{v})=\sum\limits_{\mathbf{v}}e^{-[E(\mathbf{v})+K(\mathbf{V})]} .

K is for Kadanoff RG Transform, and ln K is the normalization.

We can now write an joint Energy E(\mathbf{v,h}) with the same Partition function as our RBM E(\mathbf{h}) , but with completely different joint probability distributions.  Let


Notice what we are actually doing.  We use the K matrix to define the RBM joint Energy function.  In RBM theory, we restrict E(\mathbf{v,h}) to a quadratic form, and use variational procedure to learn the weights , thereby learning K.

In a VRG approach, we have the additional constraint that we restrict the form of K to satisfy constraints on it’s partition function, or, really, how the Energy function is normalized.  Hence the name ‘Renormalization.‘   This is similar, in spirit, but not necessarily in form, to how the RBM training regularizes the weights (above).

Write the total, or renormalized, Z as


Expanding the Energy function explicitly, we have


where the Kadanoff normalization factor \tilde{Z} appears now the denominator.

We can can break the double sum into sums over and h


Identify \tilde{Z} in the numerator


which factors out, giving a very simple expression in h


In the technical proof in the paper, the idea is that since h is just a dummy variable, we can replace h with v.  We have to be careful here since this seems to only applies to the case where we have the same number of hidden and visible units–a rare case.  In an earlier post on VRG, I explain more clearly how to construct an RG transform for RBMs.  Still,  the paper is presenting a counterargument for arguments sake, so, following the argument in the paper,  let’s say

\sum\limits_{\mathbf{h}}e^{-E(\mathbf{h})} =\sum\limits_{\mathbf{v}}e^{-E(\mathbf{v})}

This is like saying we constrain the Free Energy at each layer to be the same.


This is also another kind of Layer Normalization–a very popular method for modern Deep Learning methods these days.

So, by construction, the renormalized and data Partition functions are identical

Z^{RG}(\mathbf{v,h}) =Z(\mathbf{v})

The goal of Renormalization Group theory is to redefine the Energy function on a difference scale, while retaining the macroscopic observables.

But , and apparently this has been misstated in some ML papers and books, the marginalized probabilities can be different !

To get the marginals, let’s integrate out only the h variables


Looking above, we can write this in terms of and its normalization \tilde{Z}


which implies

\tilde{p}(\mathbf{v})\ne p(\mathbf{v})  

So what is cheap about deep learning ?

RBMs let us represent data using a smaller set of hidden features.  This is, effectively, Variational Renormalization Group algorithm,  in which we approximate the Partition function, at each step in the RBM learning procedure, without having to learn the underlying joining probability distribution.  And this is easier.  Cheaper.

In other words, Deep Learning is not Statistics.  It is more like Statistical Mechanics.  

And the hope is that we can learn from this old scientific field — which is foundational to chemistry and physics — to improve our deep learning models.

Post-Post Comments

Shortly after this paper came out, Comment on “Why does deep and cheap learning work so well?”that the proof in the Appended is indeed wrong–as I suspected and pointed out above.

It is noted that the point of the RG theory is to preserve the Free Energy form one layer to another, and, in VRG, this is expressed as a trace condition on the Transfer operator


where -T(\mathbf{v},\mathbf{h})=E(\mathbf{v})+K(\mathbf{v})+ln\tilde{Z}

It is, however, technically possible to preserve the Free Energy and not preserve the trace condition.  Indeed, because K(\mathbf{y}) is not-constant, thereby violating the trace condition.

From this bloggers perspective, the idea of preserving Free Energy, via either a trace condition, or, say, by layer normalization, is the import point.  And this may mean to only approximately satisfy the trace condition.

In Quantum Chemistry, there is a similar requirement, referred to as a Size-Consistency and/ or Size-Extensivity condition.  And these requirements proven essential to obtaining highly accurate, ab initio solutions of the molecular electronic Schrodinger equation–whether implemented exactly or approximately.

And, I suspect, a similar argument, at least in spirit if not in proof, is at play in Deep Learning.

Please chime in our my YouTube Community Channel

see also: https://m.reddit.com/r/MachineLearning/comments/4zbr2k/what_is_your_opinion_why_is_the_concept_of/)

Tensorflow Reproductions: Big Deep Simple MNIST

I am starting a new project to try and reproduce some core deep learning papers in TensorFlow from some of the big names.

The motivation:   to understand how to build very deep networks and why they do (or don’t) work.

There are several papers that caught my eye, starting with

These papers set the foundation for looking at much larger, deeper networks such as

FractalNet’s are particularly interesting since they suggest that very deep networks do not need student-teacher learning, and, instead, can be self similar.  (which is related to very recent work on the Statistical Physics of Deep Learning, and the Renormalization Group analogy).

IMHO,  it is not enough just to implement the code; the results have to be excellent as well. I am not impressed with the results I have seen so far, and I would like to flush out what is really going on.

Big Deep Simple Nets

The 2010 paper still appears to be 1 of the top 10 results on MNIST:


The idea is simple. They claim to get state-of-the-art accuracy on MNIST using a 5-layer MLP, but running a large number of epochs with just SGD, a decaying learning rate, and an augmented data set.

The key idea is that the augmented data set can provide, in practice, an infinite amount of training data.  And having infinite data means that we never have to worry about overtraining because we have too many adjustable parameters, and therefore any reasonable size network will do the trick if we just run it long enough.

In other words, there is no convolution gap,  no need for early stopping, or really no regularization at all.

This sounds dubious to me, but I wanted to see for myself.  Also, perhaps I am missing some subtle detail.  Did they clip gradients somewhere ?   Is the activation function central ?  Do we need to tune the learning rate decay ?

I  have initial notebooks on github,  and would welcome feedback and contributions, plus ideas for other papers to reproduce.

I am trying to repeat this experiment using Tensorflow and 2 kinds of augmented data sets:

  • InfiMNIST (2006) – provides nearly 1B deformations of MNIST
  • AlignMNIST (2016) – provides 75-150 epochs of deformed MNIST

(and let me say a special personal thanks to Søren Hauberg for providing this recent data set)

I would like to try other methods, such as the Keras Data Augmentation library (see below), or even the recent data generation library coming out of OpenAI.

Current results are up for

The initial results indicate that AlignMNIST is much better that InfiMNIST for this simple MLP, although I still do not see the extremely high, top-10 accuracy reported.

Furthermore, the 5-Layer InfiMNIST actually diverges after ~100 epochs.  So we still need early stopping, even with an infinite amount of data.

It may be interesting try using the Keras ImageDataGenerator class, described in this related blog on “building powerful image classification models using very little data

Also note that the OpenAI group as released a new paper and code for creating data used in generative adversarial networks (GANs).

I will periodically update this blog as new data comes in, and I have the time to implement these newer techniques.

Next, we will check in the log files and discuss the tensorboard results.

Comments, criticisms, and contributions are very welcome.

(chat on gitter )


When Regularization Fails

Another Holiday Blog:  Feedback and Questions are very welcome.

I have a client who is using Naive Bayes pretty successfully, and the subject has come up as to ‘why do we need fancy machine learning methods?’   A typical question I am asked is:

Why do we need Regularization ?

What happens if we just turn the regularizer off ?  

Can we just set it to a very small, default value ?

After all, we can just measure what customers are doing exactly.  Why not just reshow the most popular items, ads, etc to them. Can we just run a Regression?  Or Naive Bayes ?  Or just use the raw empirical estimates?

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

Sure, if you are simply estimating historical frequencies,  n-gram statistics, etc, then a simple method like Naive Bayes is fine. But to predict the future..

the Reason we need Regularization is for Generalization.  

After all, we don’t just want to make the same recommendations over and over.

chuckUnless you are just in love with ad re-targeting (sic).   We want to predict on things we have never seen.

And we need to correct for presentation bias when collecting data on things we have seen.

Most importantly, we need methods that don’t break down unexpectedly.

In this post, we will see why Regularization is subtle and non-trivial even in seemingly simple linear models like the classic Tikhonov Regularization.   And something that is almost never discussed in modern classes.

We demonstrate that ‘strong’  overtraining is accompanied by a Phase Transition–and optimal solutions lies just below this.

The example is actually motivated by something I saw recently in my consulting practice–and it was not obvious at first.


Still,  please realize–we are about to discuss a pathological case hidden in the dark corners of Tikhonov Regularization–a curiosity for mathematicians.  

See this Quora post on SVD in NLP for some additional motivations.

Let’s begin when Vapnik [1], Smale [2], and the other great minds of the field begin:

(Regularized) Linear Regression:

The most basic method in all statistics is linear regression

\mathbf{X}\mathbf{w}=\mathbf{y} .


It is attributed to Gauss, one of the greatest mathematicians ever.

One solution is to assume Gaussian noise and minimize the error.

The solution can also be constructed using the Moore-Penrose PseudoInverse.

If we multiply by  \mathbf{X}^{\dagger} to the left on both sides, we obtain


The formal solution is


which is only valid when we can actually invert the data covariance matrix \mathbf{X}^{\dagger}\mathbf{X} .

We say the problem is well-posed when (\mathbf{X}^{\dagger}\mathbf{X})^{-1}

  1.     exists,  
  2.   is unique, and
  3. is stable. 

Otherwise we say it is singular, or ill-posed [1].

In nearly all practical methods, we would not compute the inverse directly, but use some iterative technique to solve for \mathbf{\hat{w}}. In nearly all practical cases, even when the matrix seems invertible, or even well posed, it may have numerical instabilities, either due to the data or the numerical solver.  Vapnik refers to these as stochastically ill posed problems [4]

Frequently XTX can not be inverted..or should not be inverted..directly.

Tikhonov Regularization

A  trivial solution is to simply ‘regularize’ the matrix \mathbf{X}^{\dagger}\mathbf{X} by adding a small, non-zero value to the diagonal:

(\mathbf{X}^{\dagger}\mathbf{X})^{-1}\rightarrow(\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})^{-1} .

Naively, this seems like it would dampen instabilities and allow for a robust numerical solution.  And it does…in most cases.

If you want to sound like you are doing some fancy math, give it a Russian name; we call this Tikhonov Regularization [4]:


crossvalidationThis is (also) called Ridge Regression in many common packages such as scikit learn.

The parameter \lambda is the regularization parameter values, usually found with Cross Validation (CV).  While a data science best practice, CV can fail !

Grid searching \lambda , we can obtain the infamous L-curve:

Screen Shot 2015-12-27 at 10.02.54 PM

The L-curve is a log-log plot of the the norm of the regularized solution vs. the norm of the residual.  The best \lambda  lies somewhere along the bottom of the L, but we can’t really tell where (even with cross validation).

This challenge of regularization is absent from basic machine learning classes?

The optimal \lambda is actually hard to find.   Cross Validation (CV) only works in ideal cases.  And we need a CV metric, like R^{2} , which also only works in ideal cases.

We might naively imagine \lambda can just be very small–in effect, turning off the regularizer.  This is seen in the curve above, where the solution norm diverges.

The regularization fails in these 2 notable cases, when

  1.  the model errors are correlated,  which fools simple cross validation
  2.  \lambda\rightarrow 0 and the number of features ~ the number of training instances, which leads to a phase transition. 

Today we will examine this curious, spurious behavior in case 2 (and look briefly at 1)

Regimes of Applications


As von Neumann said once, “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk”

To understand where the regularization can fail, we need to distinguish between the cases in which Ridge Regression is commonly used.

Say we have N_{f} features, and N_{I} instances.  We distinguish between the High Bias and High Variance regimes.


High Variance: N_{f}\gg N_{I}

This regime is overdetermined, complicated models that are subject to overtraining.   We find multiple solutions which satisfy our constraints.

Most big data machine learning problems lie here.  We might have 100M documents and 100K words.  Or 1M images and simply all pixels as the (base) features.

Regularization lets us pick the best of solution which is the most smooth (L2 norm), or the most sparse (L1 norm), or maybe even the one with the least presentation bias (i.e. using Variance regularization to implement Counterfactural Risk Minimization

High Bias regime: N_{f}\ll N_{I}

This is the Underdetermined regime, and any solution we find, regularized or not, will most likely  generalize poorly.

When we have more features than instances, there is no solution at all (let alone a unique one)


Still, we can pick a regularizer, and the effect is similar to dimensional reduction. Tikhonov Regularization is similar truncated SVD (explained below)

Any solution we find may work, but the predictions will be strongly biased towards the training data.

But it is not only just sampling variability that can lead to poor predictions.

In between these 2 limits is an seemingly harmless case, however, this is really a

Danger Zone:  N_{f}\approx N_{I}

danger-will-robinsonThis is a rare case where the number of features = the number of instances.

This does come up in practical problems, such as classifying a large number of small text phrases.  (Something I have been working on with one of my larger clients)


The number of phrases ~ the number of unique words.

This is a dangerous case … not only because it seems so simple … but because the general theory breaks down.  Fortunately, it is only pathologicial in

The limit of zero regularization

We examine how these different regimes behave for small values of \lambda

\underset{\lambda\rightarrow 0}{\lim}\;(\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})^{-1}\mathbf{X}^{\dagger}\mathbf{y}

Let’s formalize and consider the how the predicted accuracy behaves.

The Setup:  An Analytical Formulation

We analyze our regression under Gaussian noise \mathbf{\xi} .  Let


where \mathbf{\xi} is a Gaussian random variable with unit mean and variance \sigma^{2}


This simple model lets us analyze predicted Accuracy analytically.


\mathbf{\hat{w}} be the optimal Estimate, and

\mathbf{w^{*}} be the Ground Truth, and

\mathbf{\bar{w}}=\mathbb{E}_{\xi}[\mathbf{\hat{w}}]  be expected (mean).

We would like define, and the decompose, the Generalization Accuracy into Bias and Variance contributions.

Second, to derive the Generalization Error, we need to work out how the estimator behaves as a random variable. Frequently, in Machine Learning research, one examines the worst case scenario.  Alternatively, we use the average case.

Define the Estimation Error


Notice that by Generalization Error, we usually we want to know how our model performs on a hold set or test point \mathbf{x} :


In the Appendix, we work out the generalization bounds for the worst case and the average case.  From here on, however, when we refer to Generalization Error, we mean the average case Estimation Error (above).

Bias-Variance Tradeoff

For Ridge Regression, the mean estimator is


and its variation is


We can now decompose the Estimation Error into 2 terms:

\mathbb{E}_{\xi}=\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}+\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert^{2}\;\;,  where

\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert is the Bias, and

\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}  is the Variance,

and examine each regime in the limit \lambda\rightarrow 0

The Bias is just the error of the average estimator:


and the Variance is the trace of the Covariance of the estimator


We can examine the limiting behavior of these statistics by looking at the leading terms in a series expansion for each. Consider the Singular Value Decomposition (SVD) of X

 \mathbf{X}=\mathbf{USV} .

Let \{s_{1},s_{2},\cdots,s_{m}\} be the positive singular values, and \{\mathbf{v}_{i}\} be the right singular vectors.

We can write the Bias as


With no regularization, and the number of features exceeds the number of instances, there is a strong bias, determined by the ‘extra’ singular values

\lim\limits_{\lambda\rightarrow 0}\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=\sum\limits_{N_{i}+1}^{N_{f}}(\mathbf{v}_{i}^{\dagger}\mathbf{w}^{*})^{2}\;\;when\;\;N_{f}\gg N_{I}

Otherwise, the Bias vanishes.

\lim\limits_{\lambda\rightarrow 0}\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=0\;\;when\;\;N_{f}\le N_{I}

So the Bias appears to only matter in the underdetermined case.

(Actually, this is misleading; in the overdetermined case, we can introduce a bias when tuning the regularization parameter too high)

In contrast, the Variance


\lim\limits_{\lambda\rightarrow 0}\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}=\sum\limits_{1}^{m}s_{i}^{-2}

can diverge if the minimal singular value is small:

\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}\rightarrow\infty \;\;when\;\; s_{0}\sim 0

That is,  if the singular values decrease too fast, the variance can explode.

And this can happen in the danger zone

When N_{f}\sim N_{i} the variance can be infinite! 

Which also means the Central Limit Theorem (CLT) breaks down…at least how we normally use it.

Traditionally, the (classical) CLT says that the infinite sum of i.i.d. random variables \{x_{i}\}  converges to a Gaussian distribution–when the variance of the {x_{i}} is finite.  We can, however, generalize the CLT to show that, when the variance is infinite, the sum converges to a Levy or \alpha -stable, power-law distribution.

These power-law distributions are quite interesting–and arise frequently in chemistry and physics during a phase transition. But first, let’s see the divergent behavior actually arise.

When More is Less:  Singularities

Ryota Tomioka [3] has worked some excellent examples using standard data sets

Screen Shot 2015-12-17 at 4.01.56 PM

[I will work up some ipython notebook examples soon]

The Spambase dataset is a great example.  It has N_{f}=57 features, and N_{i}=4601 instances. We run a Ridge Regression, with N_{i}\in(0,4601] , with \lambda=10^{-6} .

As the data set size N_{f} increases, the accuracy sharps to zero at N_{f}=N_{f}=57 , and then increases sharply again, saturating at $latex N_{f}\sim 10^{3} $.

Many other datasets show this anomalous behavior at N_{f}=N_{f} , as predicted by our analysis above.

Ridge Regression vs Logistic Regression: Getting out of the Hole

Where Ridge Regression fails, Logistic Regression bounces back.  Below we show some examples of RR vs LR in the N_{f}=N_{f} danger zone


Screen Shot 2015-12-17 at 4.40.13 PM

Logistic Regression still has problems, but it no where near as pathological as RR.

The reason for this is subtle.

  • In Regression, the variance goes to infinity
  • In Classification, the norm \Vert\mathbf{\hat{w}}\Vert goes to infinity

Traditional ML Theory:  Isn’t the Variance Bounded?

Traditional theory (very losely) bounds for quantities like the generalization error and (therefore) the variance.

We work out some of examples in the Appendix.

Clearly these bounds don’t work in all cases.  So what do they really tell us?  What’s the point?

These theories gives us some confidence that we can actually apply a learning algorithm–but only when the regularizer is not too small and the noise is not too large.

They essentially try to get us far away from the pathological phase transition that arises when the variance diverges.

Cross Validation and Correlated Errors

In practice, we would never just cross our fingers set \lambda=0.00000001 .

We pick a hold out set and run Generalized Cross Validation (GCV) . Yet this can fail when

  1.  the model errors \xi are not i.i.d. but are strongly correlated
  2.  we don’t know what accuracy metric to use?

Hansen [5, 7] has discussed these issue in detail…and he provides an alternative method, the L-curve, which attempt to balance the size of the regularized solution and the residuals.

Screen Shot 2015-12-27 at 11.16.54 PM

For example, we can sketch the relative errors \dfrac{\Vert w-w(\lambda)\Vert}{\Vert w\Vert_{2}} for the L-curve and GVC method (see 7).  Above we see that the L-curve relative errors are more Gaussian than GCV.

In other words, while most packages provide a small choice of regression metrics; as data scientists, the defaults like R^{2} may not be good enough.  Even using the 2-norm may not represent the errors well. As data scientists, we may need to develop our own metrics to get a good \lambda for a regression.  (or maybe just use an SVM regression).

Hansen claims that,”experiments confirm that whenever GCV finds a good regularization parameter, the corresponding solution is located at the corner of the L-curve.” [7]

This means that the optimal regularized solution lives ‘just below’ the phase transition, where the norm diverges.

Phase Transitions in Regularized Regression


What do we mean by a phase transition?

The simplest thing is to we plot a graph and look a steep cliff or sharp change.

Here, generalization accuracies drops off suddenly as \dfrac{N_{f}}{N_{i}}\rightarrow 1\;\;and\;\;\alpha\rightarrow 0

In a physical phase transition,the fluctuations (i.e. variances and norms) in the system approach infinity.

Imagine watching a pot boil


We see bubbles of all sizes, small and large.  The variation in bubble size is all over the map.  This is characteristic of a phase transition: i.e. water to gas.

When we cook, however, we frequently simmer our foods–we keep the water just below the boiling point.  This is as hot as the water can get before it changes to gas.

Likewise, it appears that in Ridge Regression, we frequently operate at a point just below the phase transition–where the solution norm explodes.  And this is quite interesting.  And, I suspect, may be important generally.

In chemical physics, we need special techniques to treat this regime, such as the Renormalization Group. Amazingly, Deep Learning / Restricted Boltzmann Machines  look very similar to the Variational Renormalization Group method.

In our next post, we will examine this idea further, by looking at the phase diagram of the traditional Hopfield Neural Networ, and the idea of Replica Symmetry Breaking in the Statistical Mechanics of Generalization. Stay tuned !


1. Vapnik and Izmailov, V -Matrix Method of Solving Statistical Inference Problems, JMLR 2015

2. Smale, On the Mathematical Foundations of Learning, 2002

3. https://github.com/ryotat

4. http://ttic.uchicago.edu/~gregory/courses/LargeScaleLearning/lectures/proj_learn1.pdf

5.  The L-curve and its use in the numerical treatment of inverse problems

6.  Discrete Ill-Posed and Rank­ Deficient Problems

7.  The use of the L-curve in the regularization of discrete ill-posed problems, 1993

Appendix:  some math

Bias-Variance Decomposition

\mathbf{y}=\mathbf{X}{w}+\mathbf{\xi} ,

when \mathbb{E}[\mathbf{\xi}]=0 ,

then Cov(\mathbf{\xi})=\sigma^{2}\mathbf{I}



be the data covariance matrix.  And , for convenience, let

\lambda_{n}=\dfrac{\lambda}{n} .

The Expected value of the optimal estimator, assuming Gaussian noise, is given using the Penrose Pseudo-Inverse


The Covariance is


The regularized Covariance matrix arises so frequently that we will assign it a symbol


We can now write down the mean and covariance


[more work needed here]

Average and Worst Case Generalization Bounds

Consider the Generalization Error for a test point \mathbf{x}

\mathbb{E}_{\mathbf{\xi}}[(\mathbf{x^{\dagger}}(\hat{\mathbf{w}}-\mathbf{w^{*}}))^{2}] =\lambda^{2}_{n}[\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}]^{2}+\dfrac{\sigma^{2}_{n}}{n}\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{\Sigma}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}

We can now obtain some very simple bounds on the Generalization Accuracy (just  like a bona-a-fide ML researcher)

The worst case bounds


The average case bounds

\mathbb{E}_{\mathbf{w^{*}}}[\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}]^{2}\le\alpha^{-1}\Vert\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{x}\Vert^{2} ,

where we assume the ‘covariance*’ is


Discrete Picard Condition

We can use the SVD approach to make stronger statements about our ability to solve the inversion problem

\mathbf{Xw}=\mathbf{y} .

I will sketch an idea called the “Discrete Picard Condition” and provide some intuition

Write \mathbf{X} as a sum over singular vectors

\mathbf{X}=\mathbf{U\Sigma V^{\dagger}}=\sum\limits_{i=1}^{N}\mathbf{u_{i}}\sigma_{i}\mathbf{v_{i}^{\dagger}} .

Introduce an abstract PseudoInverse,  defined by


Express the formal solution as


Imagine the SVD of the PseudoInverse is

\mathbf{X}^{-\dagger}=\mathbf{U^{\dagger}\Sigma^{-1}V}=\sum\limits_{i=1}^{N}\mathbf{u_{i}^{\dagger}}\sigma^{-1}_{i}\mathbf{v_{i}} .

and use it to obtain

\mathbf{w}=\sum\limits_{i=1}^{N}\dfrac{\mathbf{u^{\dagger}_{i}y}}{\sigma_{i}}\mathbf{v_{i}} .

For this expression to be meaningful, the SVD coefficients \mathbf{u^{\dagger}_{i}b} must decay, on average, faster than the corresponding singular values \sigma{i} .

Consequently, the only coefficients that carry any information are larger than the noise level.  The small coefficients have small singular values and are dominated by noise.    As the problem becomes more difficult to solve uniquely, the singular values decay more rapidly.

If we discard the k+1,N singular components,  we obtain a regularized, unsupervised solution called

Truncated SVD

\mathbf{w}=\sum\limits_{i=1}^{k}\dfrac{\mathbf{u^{\dagger}_{i}y}}{\sigma_{i}}\mathbf{v_{i}} .

which is similar in spirit to spectral clustering.


Why Deep Learning Works II: the Renormalization Group

Deep Learning is amazing.  But why is Deep Learning so successful?  Is Deep Learning just old-school Neural Networks on modern hardware?  Is it just that we have so much data now the methods work better?  Is Deep Learning just a really good at finding features. Researchers are working hard to sort this out.

Recently it has been shown that [1]

Unsupervised Deep Learning implements the Kadanoff Real Space Variational Renormalization Group (1975)

This means the success of Deep Learning is intimately related to some very deep and subtle ideas from Theoretical Physics.  In this post we examine this.

Unsupervised Deep Learning: AutoEncoder Flow Map

An AutoEncoder is a Unsupervised Deep Learning algorithm that learns how to represent an complex image or other data structure X .   There are several kinds of AutoEncoders; we care about so-called Neural Encoders–those using Deep Learning techniques to reconstruct the data:


The simplest Neural Encoder is a Restricted Boltzman Machine (RBM).  An RBM is non-linear, recursive, lossy function f(X) that maps the data X from visible nodes {v}  into hidden nodes {h} :


The RBM is learned by selecting the optimal parameters {b_{v}},{c_{h}},{w_{v,h}} that minimize some measure of the reconstruction error (see Training RBMs, below)

\min |\Vert f(X)-X\Vert

RBMs and other Deep Learning algos are formulated using classical Statistical Mechanics.  And that is where it gets interesting!

Multi Scale Feature Learning

In machine learning (ML), we map (visible) data into (hidden) features


The hidden units discover features at a coarser grain level of scale

unsupervised filters

With RBMs, when features are complex, we may stack them into a Deep Belief Network (DBM), so that we can learn at different levels of scale


and leads to multi-scale features in each layer


Deep Belief Networks are a Theory of Unsupervised MultiScale Feature Learning

Fixed Points and Flow Maps

We call f(X)  a flow map

f(X)\rightarrow X

eIf we apply the flow map to the data repeatedly, (we hope) it converges to a fixed point

\lim\limits_{n}f^{n}(f^{n-1}(\cdots(f^{1}(f^{0}(X))))\rightarrow f_{\infty}(X)

Notice that we usually expect to apply the same map each time f^{n}(x)=f(x) , however, for a computational theory we may need more flexibility.

Example: Linear Flow Map

The simplest example of a flow map is the simple linear map

X\rightarrow CX

so that

f(X)\sim CX

where C is a non-negative, low rank matrix

\mid C\mid\,\ll\,\mid X\mid

We have seen this before: this leads to a Convex form of NonNegative Matrix Factorization NMF


Convex NMF applies when we can specify the feature space and where the data naturally clusters.  Here, there are a few instances that are archetypes that define the convex hull of the data.

\{\mathbf{x}_{c}\}\in X\,,c=1,\cdots,\mid C\mid

Amazingly, many clustering problems are provably convex–but that’s a story for another post.

Example: Manifold Learning

Near a fixed point, we commonly approximate the flow map by a linear operator

f_{\infty}(X) \sim\mathbf{L}(X)

This lets us capture the structure of the true data manifold, and is usually described by the low lying eigen-spectra of


In the same spirit,  Semi & Unsupervised Manifold Learning, we model the data using a Laplacian operator \mathbf{L}(\sigma) , usually parameterized by a single scale parameter \sigma .


These methods include Spectral Clustering, Manifold Regularization , Laplacian SVM, etc.  Note that manifold learning methods, like the Manifold Tanget Classifier,  employ Contractive Auto Encoders and use several scale parameters to capture the local structure of the data manifold.

The Renormalization Group

In chemistry and physics, we frequently encounter problems that require a multi-scale description.   We need this for critical points and phase transitions, for natural crashes like earthquakes and avalanches,  for polymers and other macromolecules, for strongly correlated electronic systems, for quantum field theory, and, now, for Deep Learning.

A unifying idea across these systems is the Renormalization Group (RG) Theory.

Renormalization Group Theory is both a conceptual framework on how to think about physics on multiple scales as well as a technical & computational problem solving tool.

kenwilsonKen Wilson won the 1982 Nobel Prize in Physics for the development and application of his Momentum Space RG theory to phase transitions.

We used RG theory to model the recent BitCoin crash as a phase transition.

Wilson invented modern multi-scale modeling; the so-called Wilson Basis was an early form of Wavelets.  Wilson was also a big advocate of using supercomputers for solving problems.  Being a Nobel Laureate, he had great success promoting scientific computing.  It was thanks to him I had access to a Cray Y-MP when I was in high school because he was a professor at my undergrad, The Ohio State University.

 Here is the idea.  Consider a feature map  which transforms the data X to a different, more coarse grain scale


The RG theory requires that the Free Energy F(x) is rescaled, to reflect that

the Free Energy is both Size-Extensive and Scale Invariant near a Critical Point

This is not obvious — but it is essential to both having a conceptual understanding of complex, multi scale phenomena, and it is necessary to obtain very highly accurate numerical calculations.  In fact, being size extensive and/or size consistent is absolutely necessary for highly accurate quantum chemistry calculations of strongly correlated systems.  So it is pretty amazing but perhaps not surprising that this is necessary for large scale deep learning calculations also!

The Fundamental Renormalization Group Equation (RGE)


If we (can) apply the same map, F(x) , repeatedly, we obtain a RG recursion relation, which is the starting point for most analytic work in theoretical physics.   It is usually difficult to obtain an exact solution to the RGE (although it is illuminating when possible [20]).

Many RG formulations both approximate the exact RGE and/or only include relevant variables. To describe a multiscale system, it is essential to distinguish between these relevant and irrelevant variables.

Example: Linear Rescaling

Let’s say the feature map is a simple linear rescaling

\phi(x)=\lambda x

We can obtain a very elegant, approximate RG solution where F(x) obeys a complex (or log-periodic) power law.

\mathcal{F}(x)\sim x^{-(\alpha+i\beta)}

This behavior is thought to characterize Per-Bak style Self-Organized Criticality (SOC), which appears in many natural systems–and perhaps even in the brain itself.   Which leads to the argument that perhaps Deep Learning and Real Learning work so well because they operate like a system just near a phase transition–also known as the Sand Pile Model--operating at a state between order and chaos.

the Kadanoff Variational Renormalization Group (1975)

kadanoffLeo Kadanoff, now at the University of Chicago, invented some of the early ideas in Renormalization Group.  He is most famous for the Real Space formulation of RG, sometimes called the Block Spin approach.  He also developed an alternative approach, called the Variational Renormalization Group (VRG, 1975), which is, remarkably, what Unsupervised RBMs are implementing!

Let’s consider a traditional Neural Network–a Hopfield Associative Memory (HAM).  This is also known as an Ising model or a Spin Glass in statistical physics.

An HAM consists of only visible units; it stores memories explicitly and directly in them:


We specify the Energy — called the Hamiltonian \mathcal{H} — for the nodes.  Note that all the nodes are visible.  We write


The Hopfield model has only single B_{i} and pair-wise J_{i,j} interactions.

A general Hamiltonian might have many-body, multi-scale interactions:


The Partition Function is given as


And the Free Energy is


The idea was to mimic how our neurons were thought to store memories–although perhaps our neurons do not even do this.

Either way, Hopfield Neural Networks have many problems; most notably they may learn spurious patterns that never appeared in the training set. So they are pretty bad memories.

Hinton created the modern RBM to overcome the problems of the Hopfield model.  He used hidden units to represent the features in the data–not to memorize the data examples directly.


An RBM is specified Energy function for both the visible and hidden units


This also defines joint probability of simultaenously observing a configuration of hidden and visible spins


which is learned variationally, by minimizing the reconstruction error…or the cross entropy (KL divergence), plus some regularization (Dropout), using Greedy layer-wise unsupervised training, with the Contrastive Divergence (CD or PCD) algo, …

The specific details of an RBM Energy are not addressed by these general concepts; these details do not affect these arguments–although clearly they matter in practice !

It turns out that

Introducing Hidden Units in a Neural Network is a Scale Renormalization.  

When changing scale, we obtain an Effective Hamiltonian \tilde{\mathcal{H}}  that acts on a the new feature space (i.e the hidden units)


or, in operator form


This Effective Hamiltonian is not specified explicitly, but we know it can take the general form (of a spin funnel, actually)


The RG transform preservers the Free Energy (when properly rescaled):





Critical Trajectories and Renormalized Manifolds

The RG theory provides a way to iteratively update, or renormalize, the system Hamiltonian.  Each time we add a layer of hidden units (h1, h2, …), we have




We imagine that the flow map is attracted to a Critical Trajectory which naturally leads the algorithm to the fixed point.  At each step, when we apply another RG transform, we obtain a new, Renormalized Manifold, each one closer to the optimal data manifold.


Conceptually, the RG flow map is most useful when applied to critical phenomena–physical systems and/or simple models that undergo a phase transition.  And, as importantly, the small changes in the data should ‘wash away’ as noise and not affect the macroscopic / critical phenomena. Many systems–but not all–display this.

Where Hopfield Nets fail to be useful here, RBMs and Deep Learning systems shine.

We now show that these RG transformations are achieved by stacking RBMs and solving the RBM inference problem!

Kadanoff’s Variational Renormalization Group

As in many physics problems, we break the modeling problem into two parts:  one we know how to solve, and one we need to guess.

  1. we know the Hamiltonian at the most fine grained level of scale  \mathcal{H}(v)
  2. we seek the correlation \mathbf{V}(v,h) that couples to the next level scale

The joint Hamiltonian, or Energy function, is then given by


The Correlation V(v,h) is defined so that the partition function \mathcal{Z}  is not changed


This gives us


(Sometimes the Correlation V is called a Transfer Operator T, where V(v,h)=-T(v,h) )

We may now define a renormalized effective Hamilonian \tilde{\mathcal{H}(h)} that acts only on the hidden nodes


so that we may write


Because the partition function does not change, the Exact RGE preserves the Free Energy (up to a scale change, we we subsume into \mathcal{Z})


We generally can not solve the exact RGE–but we can try to minimize this Free Energy difference.

What Kadanoff showed, way back in 1975, is that we can accurately approximate the Exact Renormalization Group Equation by finding a lower bound using this formalism

Deep learning appears to be a real-space variational RG technique, specifically applicable to very complex, inhomogenous systems where the detailed scale transformations have to be learned from the data

RBMs expressed using Variational RG

We will now show how to express RBMs using the VRG formalism and provide some intuition

In an RBM, we simply want to learn the Energy function directly; we don’t specify the Hamiltonian for the visible or hidden units explicitly, like we would in physics.  The RBM Energy is just


We identify the Hamiltonian for the hidden units as the Renormalized Effective Hamiltonian from the VRG theory


RBM Hamiltonians / Marginal Probabilities

To obtain RBM Hamiltonians for just the visible \mathcal{H}^{RBM}(v) or hidden \mathcal{H}^{RBM}(h) nodes, we need to integrate out the other nodes; that is, we need to find the marginal probabilities.

P(v)=\sum\limits_{h}P(v,h)=\dfrac{e^{-\mathcal{H}^{RBM}(v) }}{\mathcal{Z}}=\dfrac{1}{\mathcal{Z}} \sum\limits_{h}e^{-\mathbf{E(v,h)}}



P(h)=\sum\limits_{v}P(v,h)=\dfrac{e^{-\mathcal{H}^{RBM}(h) }}{\mathcal{Z}}=\sum\limits_{v}\dfrac{e^{-\mathbf{E(v,h)}}}{\mathcal{Z}}


Training RBMs

To train an RBM, we apply Contrastive Divergence (CD), or, perhaps today, Persistent Contrastive Divergence (PCD).  We can kindof think of this as slowly approximating


In practice, however, RBM training minimizes the associated Free Energy difference \Delta\mathbf{F} … or something akin to this…to avoid overfitting.

In the “Practical Guide to Training Restricted Boltzmann Machines”, Hinton explains how to train an RBM (circa 2011).  Section 6 addresses “Monitoring the overfitting”

“it is possible to directly monitor the overfitting by comparing the free energies of training data and held out validation data…If the model is not overfitting at all, the average free energy should be about the same on training and validation data”

Other Objective Functions

Modern variants of Real Space VRG are not  “‘forced’ to minimize the global free energy” and have attempted other approaches such as Tensor-SVD Renormalization.  Likeswise, some RBM / DBM approaches do likewise may minimize a different objective.

In some methods, we minimize the KL Divergence; this has a very natural analog in VRG language [1].

Why Deep Learning Works: Lessons from Theoretical Physics

The Renormalization Group Theory provides new insights as to why Deep Learning works so amazingly well.  It is not, however, a complete theory. Rather, it is framework for beginning to understand what is an incredibly powerful, modern, applied tool.  Enjoy!


[1] An exact mapping between the Variational Renormalization Group and Deep Learning, 2014

[2] Variational Approximations for Renormalization Group Transformations, 1976

[3]  A Common Logic to Seeing Cats and Cosmos


[5] A Fundamental Theory to Model the Mind

[6] A Practical Guide to Training Restricted Boltzmann Machines, 2010

[7] On the importance of initialization and momentum in deep learning, 2013

[8] Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014

[9] Ken Wilson, A Scientific Appreciation

[10] http://www-math.unice.fr/~patras/CargeseConference/ACQFT09_JZinnJustin.pdf

[11] Training Products of Experts by Minimizing Contrastive Divergence, 2002

[12] Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient, 2008

[13] http://www.nonlin-processes-geophys.net/3/102/1996/npg-3-102-1996.pdf


[15] Scaling, universality, and renormalization: Three pillars of modern critical phenomena

[16] The Potential Energy of an Autoencoder, 2014

[17]  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction, 2011

[18] Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion, 2010

[19] Quora:  What is Renormalization group theory?

[20] Renormalization group and critical localization, 1977