# Improving RBMs with physical chemistry

Restricted Boltzmann Machines (RBMs) are like the H atom of Deep Learning.

They are basically a solved problem, and while of academic interest, not really used in complex modeling problems.  They were, upto 10 years ago, used for pretraining deep supervised nets.  Today, we can train very deep, supervised nets directly.

RBMs are the foundation of unsupervised deep learning–

an unsolved problem.

RBMs appear to outperform Variational Auto Encoders (VAEs) on simple data sets like the Omniglot set–a data set developed for one shot learning, and used in deep learning research.

RBM research continues in areas like semi-supervised learning with deep hybrid architectures, Temperature dependence,  infinitely deep RBMs, etc.

Many of basic concepts of Deep Learning are found in RBMs.

Sometimes clients ask, “how is Physical Chemistry related to Deep Learning ?”

In this post, I am going to discuss a recent advanced in RBM theory based on ideas from theoretical condensed matter physics and physical chemistry,

the Extended Mean Field Restricted Boltzmann Machine: EMF_RBM

(see: Training Restricted Boltzmann Machines via the Thouless-Anderson-Palmer Free Energy

[Along the way, we will encounter several Nobel Laureates, including the physicists David J Thouless (2016) and Philip W. Anderson (1977), and the physical chemist Lars Onsager (1968).]

RBMs are pretty simple, and easily implemented from scratch.  The original EMF_RBM is in Julia; I have ported EMF_RBM to python, in the style of the scikit-learn BernoulliRBM package.

### Show me the code

https://github.com/charlesmartin14/emf-rbm/blob/master/EMF_RBM_Test.ipynb

### Theory

We examined RBMs in the last post on Cheap Learning: Partition Functions and RBMs. I will build upon that here, within the context of statistical mechanics.

RBMs are defined by the Energy function

$E(\mathbf{v},\mathbf{h})=-\mathbf{v}^{T}\mathbf{W}\mathbf{h}-\mathbf{v}^{T}\mathbf{a}-\mathbf{b}^{T}\mathbf{h}$

To train an RBM, we minimize the log likelihood ,

$\mathcal{L}==-F^{c}(\mathbf{v})-F(\mathbf{v,h})$

the sum of the clamped and (actual) Free Energies, where

$F^{c}=ln\;\underset{\mathbf{h}}{\sum}\;E(\mathbf{v},\mathbf{h})$

and

$F=\ln\;Z(\mathbf{v,h})=ln\;\underset{\mathbf{v},\mathbf{h}}{\sum}\;E(\mathbf{v},\mathbf{h})\;\;,$

The sums range over a space of  $\mathcal{O}({2^{N}})\;\;$,

$\mathbf{v}\in[0,1]^{N_v},\mathbf{h}\in[0,1]^{N_h}$

which is intractable in most cases.

Training an RBM requires computing the log Free Energy; this is hard.

When training an RBM,  we

1. take a few ‘steps toward equilibration’, to approximate F
2. take a gradient step, $\dfrac{\partial F}{\partial w_{i,j}}$, to get W, a, b
3. regularize W (i.e. by weight decay)
4. update W, a, b
5. repeat until some stopping criteria is met

We don’t include label information, although a trained RBM can provide features for a down-stream classifier.

#### Extended Mean Field Theory:

The Extended Mean Field (EMF)  RBM is a straightforward application of known statistical mechanics theories.

There are, literally, thousands of papers on spin glasses.

The EMF RBM is a great example of how to operationalize spin glass theory.

#### Mean Field Theory

The Restricted Boltzmann Machine has a very simple Energy function, which makes it very easy to factorize the partition function Z , explained by Hugo Larochelle, to obtain the conditional probabilities

$p(h_{i}=1|v)=\sigma(b_{i}+\mathbf{W}_{i}\mathbf{h})$

$p(v_{i}=1|h)=\sigma(a_{i}+\mathbf{W}_{i}\mathbf{v})$

The conditional probabilities let us apply Gibbs Sampling, which is simply

• hold $\mathbf{v}$ fixed, sample $p(\mathbf{h}|\mathbf{v})$
• hold $\mathbf{h}$ fixed, sample $p(\mathbf{h}|\mathbf{h})$
• repeat for 1 or more equlibiration steps

In statistical mechanics, this is called a mean field theory.  This means that the Free Energy (in $\mathbf{v}$) can be written as a simple linear average over the hidden units

$F^{RBM}=\mathbf{a}^{T}\mathbf{v}+\mathbf{b}^{T}\mathbf{h}+\mathbf{v}^{T}(\mathbf{Wh})$.

where $\mathbf{Wh}$ is the mean field of the hidden units.

At high Temp., for a spin glass, a mean field model seems very sensible because the spins (i.e. activations) become uncorrelated.

$\langle s_{i}\cdots s_{j}\rangle\underset{T\rightarrow\infty}{\rightarrow}\langle s_{i}\rangle\cdots\langle s_{j}\rangle$

This is non-trivial to prove

Theoreticians use mean field models like the p-spin spherical spin glass to study deep learning because of their simplicity.  Computationally, we frequently need more.

How we can go beyond mean field theory ?

#### The Onsager Correction

Onsager was awarded 1968 the Nobel Prize in Chemistry for the development of the Onsager Reciprocal Relations, sometimes called the ‘4th law of Thermodynamics’

The Onsager relations provides the theory to treat thermodynamic systems that are in a quasi-stationary, local equilibrium.

Onsager was the first to show how to relate the correlations in the fluctuations to the linear response.  And by tying a sequence of quasi-stationary systems together, we can describe an irreversible process…

..like learning.  And this is exactly what we need to train an RBM.

In an RBM, the fluctuations are variations in the hidden and visible nodes.

In a BernoulliRBM, the activations can be 0 or 1, so the fluctuation vectors are

$(\mathbf{h}-\mathbf{0}), (\mathbf{h}-\mathbf{1}) \;\; and\;\; (\mathbf{v}-\mathbf{0}),(\mathbf{v}-\mathbf{1})$

The simplest correction to the mean field Free Energy, at each step in training, are the correlations in these fluctuations:

$\left[(\mathbf{h}-\mathbf{0})^{T}(\mathbf{h}-\mathbf{1})^{T}\mathbf{W}^{2} (\mathbf{v}-\mathbf{0})(\mathbf{v}-\mathbf{1})\right]$

where W is the Energy weight matrix.

The corrections make sense under the stationarity constraints, that the Extended Mean Field RBM Free Energy ($F^{EMF}$) is at a critical point

$\left[\dfrac{dF^{EMF}}{d\mathbf{v}},\dfrac{dF^{EMF}}{d\mathbf{h}}\right]=0$,

That is, small changes in the activations $(\mathbf{v},\mathbf{h})$ do not change the Free Energy.

We will show that we can write

$F^{EMF}=S+\beta\;F^{MF}+\frac{1}{2}\beta^{2}F^{Onsager}+\cdots$

as a Taylor series in $\beta$, the inverse Temperature, where $S$ is the Entropy

$S=\mathbf{v}\;ln(\mathbf{v})+(\mathbf{1}-\mathbf{v})\;ln(\mathbf{1}-\mathbf{v})+\mathbf{h}\;ln(\mathbf{h})+(\mathbf{1}-\mathbf{h})\;ln(\mathbf{1}-\mathbf{h})$,

$F^{MF}$ is the standard, mean field RBM Free energy

$F=\mathbf{a}^{T}\mathbf{v}+\mathbf{b}^{T}\mathbf{h}+\mathbf{v}^{T}\mathbf{W}\mathbf{h}$,

and $F^{Onsager}$ is the Onsager correction

$F^{Onsager}=\left[(\mathbf{h}-\mathbf{h}^{2})^{T}\mathbf{W}^{2} (\mathbf{v}-\mathbf{v}^{2})\right]$.

Given the expressions for the Free Energy, we must now evaluate it.

#### Thouless-Anderson-Palmer TAP Theory

The Taylor series above is a result of the TAP theory — the Thouless-Anderson-Palmer approach developed for spin glasses.

The TAP theory is outlined in the Appendix; here it is noted that

#### Temperature Dependence and Weight Decay

Being a series in inverse Temperature $\beta$, the theory applies at low $\beta$, or high Temperature.   For fixed $\beta$, this also corresponds to small weights W.

Specifically, the expansion applies at Temperatures above the glass transition–a concept which I describe in a recent video blog.

Here, to implement the EMF_RBM, we set

$\beta=1\;\;$,

and, instead, apply weight decay to keep the weights W from exploding

$\mathbf{dW}\rightarrow\mathbf{dW}+\delta\Vert\mathbf{W}\Vert$

where $\Vert\mathbf{W}\Vert$ may be an L1 or L2 norm.

Weight Decay acts to keep the Temperature high.

#### RBMs with explicit Temperature

Early RBM computational models were formulated using statistical mechanics  (see the Appendix) language, and so included a Temperature parameter, and were solved using techniques like simulated annealing and the (mean field) TAP equations (described below).

Adding  Temperature allowed the system to ‘jump’ out of the spurious local minima.  So any usable model required a non-zero Temp, and/or some scheme to avoid local minima that generalized poorly.  (See:  Learning Deep Architectures for AI, by Bengio)

These older approaches did not work well –then — so Hinton proposed the Contrastive Divergence (CD) algorithm.  Note that researchers struggled for some years to ‘explain’ what optimization problem CD actually solves.

More that recent work on Temperature Bases RBMs also suggests that higher T solutions perform better, and that

“temperature is an essential parameter controlling the selectivity of the firing neurons in the hidden layer.”

#### Training without Sampling:  Fixed Point equations

Standard RBM training approximates the (unconstrained) Free Energy, F=ln Z, in the mean field approximation, using (one or more steps of) Gibbs Sampling.  This is usually implemented as Contrastive Divergence (CD), or Persistent Contrastive Divergence (PCD).

Using techniques of statistical mechanics, however, it is possible to train an RBM directly, without sampling, by solving a set of deterministic fixed point equations.

Indeed, this approach clarifies how to view an RBM as solving a (determinisitic) fixed point equation of the form

$f(\mathbf{X})\rightarrow\mathbf{X}$

Consider each step, at at fixed ($\mathbf{W}, \mathbf{a}, \mathbf{b}$), as a Quasi-Stationary system, which is close to equilibrium, but we don’t need to evaluate ln Z(v,h) exactly.

We can use the stationary conditions to derive a pair of coupled, non-linear equations

$v_{i}\simeq\sigma\left[a_{i}+\underset{j}{\sum}w_{i,j}h_{j}-(v_{i}-\frac{1}{2})w^{2}_{i,j}(h_{j}-(h_{j})^{2})+\cdots\right]$

$h_{i}\simeq\sigma\left[b_{i}+\underset{j}{\sum}w_{i,j}v_{j}-(h_{i}-\frac{1}{2})w^{2}_{i,j}(v_{j}-(v_{j})^{2})+\cdots\right]$

They extend the standard formula of sigmoid linear activations $\sigma$ with additional, non-linear, inter-layer interactions.

They differs significantly from (simple) Deep Learning activation functions because the activation for each layer explicitly includes information from other layers.

This extension couples the mean ($v_{i}-\frac{1}{2}$) and total fluctuations ($v_{j}-(v_{j})^{2}$) between layers.  Higher order correlations could also be included, even to infinite order, using techniques from field theory.

We can not satisfy both equations simultaneously, but we can satisfy each condition $\left[\dfrac{dF^{EMF}}{d\mathbf{v}},\dfrac{dF^{EMF}}{d\mathbf{h}}\right]=0$ individually, letting us write a set of recursion relations

$v_{i}[t+1]\leftarrow\sigma\left[a_{i}+\underset{j}{\sum}w_{i,j}h_{j}[t]-(v_{i}[t]-\frac{1}{2})w^{2}_{i,j}(h_{j}[t]-(h_{j}[t])^{2})\right]$

$h_{i}[t+1]\leftarrow\sigma\left[b_{i}+\underset{j}{\sum}w_{i,j}v_{j}[t+1]-(h_{i}[t+1]-\frac{1}{2})w^{2}_{i,j}(v_{j}[t+1]-(v_{j}[t+1])^{2})\right]$

These fixed point equations converge to the stationary solution, leading to a local equilibrium.  Like Gibbs Sampling, however,  we only need a few iterations (say t=3 to 5). Unlike Sampling, however, the EMF RBM is deterministic.

#### Implementation

https://github.com/charlesmartin14/emf-rbm/

If there is enough interest, I can do a pull request on sklearn to include it.

The next blog post will demonstrate how the python code in action.

### Appendix

#### the Statistical Mechanics of Inference

##### Early work on Hopfield Associative Memories

Most older physicists will remember the Hopfield model.  They peaked in 1986, although interesting work continued into the late 90s (when I was a post doc).

Originally, Boltzmann machines were introduced as a way to avoid spurious local minima while including ‘hidden’ features into Hopfield Associative Memories (HAM).

Hopfield himself was a theoretical chemist, and his simple model HAMs were of  great interest to theoretical chemists and physicists.

Hinton explains Hopfield nets in his on-line lectures on Deep Learning.

The Hopfield Model is a kind of spin glass, which acts like a ‘memory’ that can recognize ‘stored patterns’.  It was originally developed as a quasi-stationary solution of more complex, dynamical models of neuronal firing patterns (see the Cowan-Wilson model).

Early theoretical work on HAMs studied analytic approximations to ln Z to compute their capacity ($\rho/N$), and their phase diagram.  The capacity is simply the number of patterns  $\rho$ a network of size N can memorize without getting confused.

The Hopfield model was traditionally run at T=0.

Looking at the T=0 line, at extremely low capacity, the system has stable mixed states that correspond to ‘frozen’ memories.  But this is very low capacity, and generally unusable.  Also, when the capacity too large, $\dfrac{p}{N}\ge 0.138$, (which is really not that large),  the system abruptly breaks down completely.

There is a small window of capacity, $\dfrac{p}{N}\le~0.05$,with stable pattern equilibria, dominated by frozen out, spin glass states.   The problem is, for any realistic system, with correlated data, the system is dominated by spurious local minima which look like low energy spin glass states.

So the Hopfield model suggested that

glass states can be useful minima, but

we want to avoid low energy  (spurious) glassy states.

One can try to derive direct mapping between Hopfield Nets and RBMs (under reasonable assumptions). Then the RBM capacity is proportional to the number of  Hidden nodes.    After that, the analogies stop.

The intuition about RBMs is different since (effectively) they operate at a non-zero Temperature.  Additionally, it is unclear to this blogger if the proper description of deep learning is a mean field spin glass, with many useful local minima, or a strongly correlated system, which may behave very differently, and more like a funneled energy landscape.

Thouless-Anderson-Palmer Theory of Spin Glasses

The TAP theory is one of the classic analytic tools used to study spin glasses and even the Hopfield Model.

We will derive the EMF RBM method following the Thouless-Anderson-Palmer (TAP) approach to spin glasses.

On a side note he TAP method introduces us to 2 more Nobel Laureates:

• David J Thouless, 2016 Nobel Prize in Physics
• Phillip W Anderson, 1977 Nobel Prize in Physics

The TAP theory, published in 1977, presented a formal approach to study the thermodynamics of mean field spin glasses.  In particular, the TAP theory provides an expression for the average spin

$\langle s_{i}\rangle=tanh(C+\beta\;MeanField)-\beta^{2}\;Onsager)$

where $tanh()$ is like an activation function, C is a constant, and the MeanField and Onsager terms are like our terms above.

In 1977, they argued that the TAP approach would hold for all Temperatures (and external fields), although it was only proven until 25 years later by Talagrand.  It is these relatively new, rigorous approaches that are cited by Deep Learning researchers like LeCun , Chaudhari, etc.    But many of the cited results have been suggested using the TAP approach.  In particular, the structure of the Energy Landscape, has been understood looking at the stationary points of the TAP free energy.

More importantly, the TAP approach can be operationalized, as a new RBM solver.

#### The TAP Equations

We start with the RBM Free Energy $F$.

Introduce an ‘external field’ $\mathbf{q}=\{q_{i}\}$ which couples to the spins, adding a linear term to the Free Energy

$\beta F[\mathbf{q}]=\ln\;\sum\limits_{\mathbf{s}}\exp(-\beta[ E(\mathbf{s})-\mathbf{q}^{T}\mathbf{s}] )$

Physically, $\mathbf{q}$ would be an external magnetic field which drives the system out-of-equilibrium.

As is standard in statistical mechanics, we take the Legendre Transform,  in terms of a set of conjugate variables $\mathbf{m}=\{m_{ij}\}$.  These are the magnetizations of each spin under the applied field,  and describe how the spins behave outside of equilibrium

$-\beta\;\Gamma[\mathbf{m}]=-\beta\;\underset{\mathbf{q}}{\max}\left[F[\mathbf{q}]+\mathbf{m}^{T}\mathbf{q}\right]$

The transform which effectively defines a new interaction ensemble $\Gamma[\mathbf{m}]$.  We now set $\mathbf{q}=0$, and note

$-\beta F=\beta\;F[\mathbf{q}=0]=\beta\;\underset{\mathbf{q}}{\min}\left[\Gamma[\mathbf{m}]\right]=-\beta\;\Gamma[\mathbf{m}^{*}]$

Define an interaction Free Energy to describe the interaction ensemble

$A(\beta,\mathbf{m})=\beta\;\Gamma[\mathbf{m}]$

which equals the original Free Energy when $\mathbf{q}$.

Note that because we have visible and hidden spins (or nodes), we will identify magnetizations for each

$\mathbf{m}=[\mathbf{m^{v},m^{h}}]$

Now, recall we want to avoid the glassy phase; this means we keep the Temperature high.   Or $\beta$ low.

We form a low Taylor series expansion $\beta$  in the new ensemble

$A(\beta,\mathbf{m})=A(0,\mathbf{m})+\beta\dfrac{\partial A(\beta,\mathbf{m})}{\partial\beta}\Big\vert_{\beta=0}+\dfrac{\beta}{2}\dfrac{\partial^{2} A(\beta,\mathbf{m})}{\partial\beta^{2}}\Big\vert_{\beta=0}+\cdots$

which, at low order in $\beta$, we expect to be reasonably accurate even away from equilibrium, at least at high Temp.

This leads to an order-by-order expansion for the Free Energy $A(\beta,\mathbf{m})$.  The first order ($\mathcal{O}(\beta)$) correction is the mean field term.  The second order term ($\mathcal{O}(\beta^{2})$) is the Onsager correction.

##### Second Order Free Energy

Upto $\mathcal{O}(\beta^{2})$, we have

$A(0,\mathbf{m^{v}},\mathbf{m^{h}})=S(\mathbf{m^{v}})+S(\mathbf{m^{h}})$

$\dfrac{\partial}{\partial\beta}A(\beta,\mathbf{m^{v}},\mathbf{m^{h}})=\mathbf{a^{T}m^{v}}+\mathbf{b^{T}m^{h}}+\mathbf{(m^{v})^{T}Wm^{h}}$

$\dfrac{\partial^{2}}{\partial\beta^{2}}A(\beta,\mathbf{m^{v}},\mathbf{m^{h}})=\frac{1}{2}\mathbf{(m^{v}-(m^{v})^{2})^{T}W^{2}(m^{h}-(m^{h})^{2})}$

or

$A(0,\mathbf{m^{v}})=\underset{i}{\sum}m^{v}\;ln\;m^{v}+(1-m^{v})ln(1-m^{v})$

$A(0,\mathbf{m^{h}})=\underset{i}{\sum}m^{h}\;ln\;m^{h}+(1-m^{h})ln(1-m^{h})$

$\dfrac{\partial}{\partial\beta}A(\beta,\mathbf{m^{v}},\mathbf{m^{h}})=\underset{i}{\sum}a_{i}m^{v}_{i}+\underset{i}{\sum}b_{i}m^{h}_{i}+\underset{i,j}{\sum}m^{v}_{i}w_{i,j}m^{h}_{j}$

$\dfrac{\partial^{2}}{\partial\beta^{2}}A(\beta,\mathbf{m^{v}},\mathbf{m^{h}})=\frac{1}{2}\underset{i,j}{\sum}(m^{v}_{i}-(m^{v})^{2}_{i})w^{2}_{i,j}(m^{h}_{j}-(m^{h})^{2}_{j})$

##### Stationary Condition

Rather than assume equilibrium, we assume that, at each step during inference –at fixed ($\mathbf{W}, \mathbf{a}, \mathbf{b}$–the system satisfies a quasi-stationary condition.  Each step reaches a a local saddle point in phase space, s.t.

$\dfrac{dA}{d\mathbf{m}}=\left[\dfrac{dA}{d\mathbf{m^{v}}},\dfrac{dA}{d\mathbf{m^{h}}}\right]=0$

where, for each layer

$\dfrac{dA}{d\mathbf{m^{v}}}=0$

##### Stationary Magnetizations

Applying the stationary conditions lets us write coupled equations for the individual magnetizations that effectively define the (second order), high Temp, quasi-equilibrium states

$m^{v}_{i}\simeq\sigma\left[a_{i}+\underset{j}{\sum}w_{i,j}m^{h}_{j}-(m^{v}_{i}-\frac{1}{2})w^{2}_{i,j}(m^{h}_{j}-(m^{h}_{j})^{2})\right]$

$m^{h}_{i}\simeq\sigma\left[b_{i}+\underset{j}{\sum}w_{i,j}m^{v}_{j}-(m^{h}_{i}-\frac{1}{2})w^{2}_{i,j}(m^{v}_{j}-(m^{v}_{j})^{2})\right]$

Notice that the $m^{v}_{i},m^{h}_{i}$ resemble the RBM conditional probabilities; in fact, at first order in $\beta$, they are the same.

##### Mean Field Theory

$\beta=1$ gives a mean field theory, and

$P(h|v)=m^{h}_{i}$

$P(v|h)=m^{v}_{i}$

And in the late 90’s, mean field TAP theory was attempted, unsuccessfully,  to create an RBM solver.

##### Fixed Point Equations

At second order, the magnetizations $m^{v}_{i},m^{h}_{i}$ are coupled through the Onsager corrections.  To solve them, we can write down the fixed point equations, shown above.

### Higher Order Corrections

We can include higher order correction the Free Energy $A$ by including more terms in the Taylor Series. This is called a Plefka expansion.  The terms can be represented using diagrams

Plefka derived these terms in 1982, although it appears he only published up to the Onsager correction; a recent paper shows how to obtain all high order terms.

The Diagrammatic expansion appears not to have been fully worked out, and is only sketched above.

I can think of at least 3 ways to include these higher terms:

• Include all terms at a higher order in $\beta$.   The Sphinx Boltzmann.jl packages includes all terms up to $latex \mathcal{O}(\beta^{3}). This changes both the Free Energy and the associated fixed point equations for the magnetizations. • Resum all diagrams of a certain kind. This is an old-school renormalization procedure, and it would include an $\infty$ number of terms. This requires working out the analytic expressions for some class of diagrams, such as all circle (RPA-like) diagrams This is similar, in some sense, to the infinite RBM by Larochelle, which uses an resummation trick to include an infinite number of Hidden nodes. • Use a low order Pade Approximant, to improve the Taylor Series. The idea is to include higher order terms implicitly by expanding the Free Energy as a ratio of polynomial functions P, Q $A^{P}(\beta,\mathbf{m})=\dfrac{P(\beta,\mathbf{m})}{Q(\beta,\mathbf{m})}$ , where $A(\beta,\mathbf{m}) -A^{P}(\beta,\mathbf{m})=0$, Obviously there are lots of interesting things to try. ### Beyond Binary Spins; Real-Valued RBMs The current python EMF_RBM only treats binary data, just like the scikit learn BernoulliRBM. So for say MNIST, we have to use the binarized MNIST. There is some advantage to using Binarized Neural Networks on GPUs. Still, a non-binary RBM may be useful. Tremel et. al. have suggested how to use real-valued data in the EMF_RBM, although in the context of Compressed Sensing Using Generalized Boltzmann Machines. # MMDS 2016 Video Presentation Why Deep Learning Works: Perspectives from Theoretical Chemistry: # on Cheap Learning: Partition Functions and RBMs This is the question posed by a recent article. Deep Learning seems to require knowing the Partition Function–at least in old fashioned Restricted Boltzmann Machines (RBMs). Here, I will discuss some aspects of this paper, in the context of RBMs. ### Preliminaries We can use RBMs for unsupervised learning, as a clustering algorithm, for pretraining larger nets, and for generating sample data. Mostly, however, RBMs are an older, toy model useful for understanding unsupervised Deep Learning. #### RBMs We define an RBM with an Energy Function $E(\mathbf{v},\mathbf{h})=-\mathbf{v}^{T}\mathbf{W}\mathbf{h}-\mathbf{v}^{T}\mathbf{a}-\mathbf{b}^{T}\mathbf{h}$ and it’s associated Partition Function $Z(\mathbf{v},\mathbf{h})=\sum\limits_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h}})$ The joint probability is then $P(\mathbf{v},\mathbf{h})=\dfrac{1}{Z(\mathbf{v},\mathbf{h})}e^{-E(\mathbf{v},\mathbf{h})}$ and the probability of the visible units is computed by marginalizing over the hidden units $p(\mathbf{v})=\dfrac{1}{Z(\mathbf{v},\mathbf{h})}\sum\limits_{\mathbf{h}}e^{-E(\mathbf{v},\mathbf{h})}$ Note we also mean the probability of observing the data X={v}, given the weights W. $p(\mathbf{v})=p(\mathbf{v}|\mathbf{W,a,b})$ The Likelihood is just the log of the probability $\mathcal{L}=\ln\;p(\mathbf{v})=\ln\;\sum\limits_{\mathbf{h}}e^{-E(\mathbf{v,h})}-\ln\;Z(\mathbf{v,h})$ We can break this into 2 parts: $\mathcal{L}=-F^{c}(\mathbf{v})+F(\mathbf{v,h})$ $F(\mathbf{v,h})$ is just the standard Free Energy $F=\ln\;Z(\mathbf{v,h})$ We call $F^{c}(\mathbf{v})$ the clamped Free Energy $F^{c}(\mathbf{v})=\ln\;\sum\limits_{\mathbf{h}}e^{-E(\mathbf{v,h})}$ because it is like a Free Energy, but with the visible units clamped to the data X. The clamped $F^{c}(\mathbf{v})$ is easy to evaluate in the RBM formalism, whereas $F(\mathbf{v,h})$ is computationally intractable. Knowing the $Z(\mathbf{v,h})$ is ‘like’ knowing the equilibrium distribution function, and methods like RBMs appear to approximate $\ln\;Z(\mathbf{v,h})$ in some form or another. #### RBM Training Training an RBM proceeds iteratively by approximating the Free Energies at each step, $F(\mathbf{v,h})=\cdots$ and then updating W with a gradient step $\mathbf{W}\rightarrow\mathbf{W}+\lambda\mathbf{dw}$ RBMs are usually trained via Contrastive Divergence (CD or PCD). The Energy function, being quadratic, lets us readily factor Z using a mean field approximation, leading to simple expressions for the conditional probabilities $p(h_{i}=1|v)=\sigma(b_{i}+\mathbf{W}_{i}\mathbf{v})$ $p(v_{i}=1|h)=\sigma(a_{i}+\mathbf{W}_{i}\mathbf{h})$ and the weight update rule $d\mathbf{w}=\langle \mathbf{v}^{T}\mathbf{h} \rangle_{0}-\langle\mathbf{v}^{T}\mathbf{h}\rangle_{\infty}$ ##### positive and negative phases RBM codes may use the terminology of positive and negative phases: $positive\;\;\langle\mathbf{v,h}\rangle^{+}\rightarrow negative\;\;\langle\mathbf{v,h}\rangle^{-}$ $\langle\mathbf{v,h}\rangle^{+}$: The expectation $\langle\mathbf{v,h}\rangle_{0}$ is evaluated, or clamped, on the data. $\langle\mathbf{v,h}\rangle^{-}$: The expectation $\langle\mathbf{v,h}\rangle_{\infty}$ is to be evaluated on the prior distribution $p(\mathbf{X}|\mathbf{W})$. We also say $\langle\dot\rangle_{\infty}$ is evaluated in the limit of infinite sampling, at the so-called equilibrium distribution. But we don’t take the infinite limit. CD approximates $\langle\mathbf{v}^{T}\mathbf{h}\rangle_{\infty}$ –effectively evaluating the (mean field) Free Energy by running only 1 (or more) steps of Gibbs Sampling. So we may see $d\mathbf{w}=\langle \mathbf{v}^{T}\mathbf{h} \rangle_{0}-\langle\mathbf{v}^{T}\mathbf{h}\rangle_{1}$ or, more generally, and in some code bases, something effectively like $d\mathbf{w}=\langle \mathbf{v}^{T}\mathbf{h} \rangle^{+}-\langle\mathbf{v}^{T}\mathbf{h}\rangle^{-}$ #### Pseudocode is: Initialize the positive $(\mathbf{v,h})^{+}$ and negative $(\mathbf{v,h})^{-}$ Run N iterations of: Run 1 Step of Gibbs Sampling to get the negative $(\mathbf{v,h})^{-}$: sample the hiddens given the (current) visibles: $p(h^{-}_{i}=1|\mathbf{v})$ sample the visibles given the hiddens (above): $p(v^{-}_{i}=1|\mathbf{h})$ Calculate the weight gradient: $dw_{i,j}=\langle \mathbf{v}^{T}\mathbf{h} \rangle^{+}-\langle\mathbf{v}^{T}\mathbf{h}\rangle^{-}$ Apply Weight decay or other regularization (optional): $\mathbf{dw}\rightarrow\mathbf{dw}+\delta\Vert\mathbf{W}\Vert$ Apply a momentum (optional): $\mathbf{dw}\rightarrow\mathbf{dw}+\mu\mathbf{W}_{prev}$ Update the Weights: $\mathbf{W}\rightarrow\mathbf{W}+\lambda\mathbf{dw}$ ### Energy Renormalizations What is Cheap about learning ? A technical proof in the Appendix notes that knowing the Partition function $Z(\mathbf{v,h})$ is not the same as knowing the underlying distribution $p(\mathbf{v})$. This is because the Energy $E(\mathbf{v,h})$ can be rescaled, or renormalized, in many different ways, without changing $Z(\mathbf{v,h})$ . This is a also key idea in Statistical Mechanics. The Partition function is a generating function; we can write all the macroscopic, observable thermodynamic quantities as partial derivatives of $\ln\;Z$. And we can do this without knowing the exact distribution functions or energies–just their renormalized forms. Of course, our W update rule is a derivative of $\ln\;Z(\mathbf{v,h})$ The proof is technically straightforward, albeit a bit odd at first. #### Matching Z(y) does not imply matching p(y) Let’s start with the visible units $\mathbf{v}=$. Write $p(\mathbf{v})=\dfrac{1}{Z}e^{-E(\mathbf{v})}$ $Z(\mathbf{v})=\sum\limits_{\mathbf{v}}e^{-E(\mathbf{v})}$ We now introduce the hidden units, $\mathbf{h}$, into the model, so that we have a new, joint probability distribution $p(\mathbf{v,h})=\dfrac{1}{Z}e^{-E(\mathbf{v,h})}$ and a new, Renormalized , partition function $Z^{RG}(\mathbf{v,h})=\sum\limits_{\mathbf{v,h}}e^{-E(\mathbf{v,h})}$ Where RG means Renormalization Group. We have already discussed that the general RBM approach resembles the Kadanoff Variational Renormalization Group (VRG) method, circa 1975. This new paper points out a small but important technical oversight made in the ML literature, namely that having $Z^{RG}(\mathbf{v,h})=Z(\mathbf{v})$ does not imply $\tilde{p}(\mathbf{v})=\sum\limits_{\mathbf{h}}\tilde{p}(\mathbf{v})=p(\mathbf{v})$ That is, just because we can estimate the Partition function well does not mean we know the probability distributions. Why? Define an arbitrary non-constant function $K(\mathbf{v})$ and write $\tilde{Z}(\mathbf{v})=\sum\limits_{\mathbf{v}}e^{-[E(\mathbf{v})+K(\mathbf{V})]}$. K is for Kadanoff RG Transform, and ln K is the normalization. We can now write an joint Energy $E(\mathbf{v,h})$ with the same Partition function as our RBM $E(\mathbf{h})$, but with completely different joint probability distributions. Let $E(\mathbf{v,h})=E(\mathbf{v})+E(\mathbf{h})+K(\mathbf{v})+ln\;K(\mathbf{v})$ Notice what we are actually doing. We use the K matrix to define the RBM joint Energy function. In RBM theory, we restrict $E(\mathbf{v,h})$ to a quadratic form, and use variational procedure to learn the weights , thereby learning K. In a VRG approach, we have the additional constraint that we restrict the form of K to satisfy constraints on it’s partition function, or, really, how the Energy function is normalized. Hence the name ‘Renormalization.‘ This is similar, in spirit, but not necessarily in form, to how the RBM training regularizes the weights (above). Write the total, or renormalized, Z as $Z^{RG}(\mathbf{v,h})=Z_{tot}(\mathbf{v,h})=\sum\limits_{\mathbf{v,h}}e^{-E(\mathbf{v,h})}$ Expanding the Energy function explicitly, we have $=\dfrac{1}{\tilde{Z}(\mathbf{v})}\sum\limits_{\mathbf{v,h}}e^{-[E(\mathbf{v})+E(\mathbf{h})+K(\mathbf{v})]}$ where the Kadanoff normalization factor $\tilde{Z}$ appears now the denominator. We can can break the double sum into sums over and h $=\dfrac{1}{\tilde{Z}(\mathbf{v})}\sum\limits_{\mathbf{v}}e^{-[E(\mathbf{v})+K(\mathbf{v})]}\sum\limits_{\mathbf{h}}e^{-E(\mathbf{h})}$ Identify $\tilde{Z}$ in the numerator $=\dfrac{1}{\tilde{Z}(\mathbf{v})}\tilde{Z}(\mathbf{v})\sum\limits_{\mathbf{h}}e^{-E(\mathbf{h})}$ which factors out, giving a very simple expression in h $=\sum\limits_{\mathbf{h}}e^{-E(\mathbf{h})}$ In the technical proof in the paper, the idea is that since h is just a dummy variable, we can replace h with v. We have to be careful here since this seems to only applies to the case where we have the same number of hidden and visible units–a rare case. In an earlier post on VRG, I explain more clearly how to construct an RG transform for RBMs. Still, the paper is presenting a counterargument for arguments sake, so, following the argument in the paper, let’s say $\sum\limits_{\mathbf{h}}e^{-E(\mathbf{h})} =\sum\limits_{\mathbf{v}}e^{-E(\mathbf{v})}$ This is like saying we constrain the Free Energy at each layer to be the same. $\ln\;\sum\limits_{\mathbf{h}}e^{-E(\mathbf{h})}\approx\ln\;\sum\limits_{\mathbf{v}}e^{-E(\mathbf{v})}$ This is also another kind of Layer Normalization–a very popular method for modern Deep Learning methods these days. So, by construction, the renormalized and data Partition functions are identical $Z^{RG}(\mathbf{v,h}) =Z(\mathbf{v})$ The goal of Renormalization Group theory is to redefine the Energy function on a difference scale, while retaining the macroscopic observables. But , and apparently this has been misstated in some ML papers and books, the marginalized probabilities can be different ! To get the marginals, let’s integrate out only the h variables $\tilde{p}(\mathbf{v})=\dfrac{1}{Z^{RG}(\mathbf{v,h})}\sum\limits_{\mathbf{h}}e^{-E(\mathbf{v,h})}$ Looking above, we can write this in terms of and its normalization $\tilde{Z}$ $\tilde{p}(\mathbf{v})=\dfrac{1}{\tilde{Z}(\mathbf{v})}e^{-[E(\mathbf{v})+K(\mathbf{v})]}$ which implies $\tilde{p}(\mathbf{v})\ne p(\mathbf{v})$ #### So what is cheap about deep learning ? RBMs let us represent data using a smaller set of hidden features. This is, effectively, Variational Renormalization Group algorithm, in which we approximate the Partition function, at each step in the RBM learning procedure, without having to learn the underlying joining probability distribution. And this is easier. Cheaper. In other words, Deep Learning is not Statistics. It is more like Statistical Mechanics. And the hope is that we can learn from this old scientific field — which is foundational to chemistry and physics — to improve our deep learning models. ### Post-Post Comments Shortly after this paper came out, Comment on “Why does deep and cheap learning work so well?”that the proof in the Appended is indeed wrong–as I suspected and pointed out above. It is noted that the point of the RG theory is to preserve the Free Energy form one layer to another, and, in VRG, this is expressed as a trace condition on the Transfer operator $\sum\limits_{\mathbf{h}}e^{-T(\mathbf{v},\mathbf{h})}=1$ where $-T(\mathbf{v},\mathbf{h})=E(\mathbf{v})+K(\mathbf{v})+ln\tilde{Z}$ It is, however, technically possible to preserve the Free Energy and not preserve the trace condition. Indeed, because $K(\mathbf{y})$ is not-constant, thereby violating the trace condition. From this bloggers perspective, the idea of preserving Free Energy, via either a trace condition, or, say, by layer normalization, is the import point. And this may mean to only approximately satisfy the trace condition. In Quantum Chemistry, there is a similar requirement, referred to as a Size-Consistency and/ or Size-Extensivity condition. And these requirements proven essential to obtaining highly accurate, ab initio solutions of the molecular electronic Schrodinger equation–whether implemented exactly or approximately. And, I suspect, a similar argument, at least in spirit if not in proof, is at play in Deep Learning. Please chime in our my YouTube Community Channel see also: https://m.reddit.com/r/MachineLearning/comments/4zbr2k/what_is_your_opinion_why_is_the_concept_of/) # Why Deep Learning Works III: a preview A preview of my upcoming talk at mmds this week: Stay tuned for a video link, and a blog post to accompany this. Comments, questions, and bug fixes are welcome. # Tensorflow Reproductions: Big Deep Simple MNIST I am starting a new project to try and reproduce some core deep learning papers in TensorFlow from some of the big names. The motivation: to understand how to build very deep networks and why they do (or don’t) work. There are several papers that caught my eye, starting with These papers set the foundation for looking at much larger, deeper networks such as FractalNet’s are particularly interesting since they suggest that very deep networks do not need student-teacher learning, and, instead, can be self similar. (which is related to very recent work on the Statistical Physics of Deep Learning, and the Renormalization Group analogy). IMHO, it is not enough just to implement the code; the results have to be excellent as well. I am not impressed with the results I have seen so far, and I would like to flush out what is really going on. ### Big Deep Simple Nets The 2010 paper still appears to be 1 of the top 10 results on MNIST: http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html The idea is simple. They claim to get state-of-the-art accuracy on MNIST using a 5-layer MLP, but running a large number of epochs with just SGD, a decaying learning rate, and an augmented data set. The key idea is that the augmented data set can provide, in practice, an infinite amount of training data. And having infinite data means that we never have to worry about overtraining because we have too many adjustable parameters, and therefore any reasonable size network will do the trick if we just run it long enough. In other words, there is no convolution gap, no need for early stopping, or really no regularization at all. This sounds dubious to me, but I wanted to see for myself. Also, perhaps I am missing some subtle detail. Did they clip gradients somewhere ? Is the activation function central ? Do we need to tune the learning rate decay ? I have initial notebooks on github, and would welcome feedback and contributions, plus ideas for other papers to reproduce. I am trying to repeat this experiment using Tensorflow and 2 kinds of augmented data sets: • InfiMNIST (2006) – provides nearly 1B deformations of MNIST • AlignMNIST (2016) – provides 75-150 epochs of deformed MNIST (and let me say a special personal thanks to Søren Hauberg for providing this recent data set) I would like to try other methods, such as the Keras Data Augmentation library (see below), or even the recent data generation library coming out of OpenAI. Current results are up for The initial results indicate that AlignMNIST is much better that InfiMNIST for this simple MLP, although I still do not see the extremely high, top-10 accuracy reported. Furthermore, the 5-Layer InfiMNIST actually diverges after ~100 epochs. So we still need early stopping, even with an infinite amount of data. It may be interesting try using the Keras ImageDataGenerator class, described in this related blog on “building powerful image classification models using very little data Also note that the OpenAI group as released a new paper and code for creating data used in generative adversarial networks (GANs). I will periodically update this blog as new data comes in, and I have the time to implement these newer techniques. Next, we will check in the log files and discuss the tensorboard results. Comments, criticisms, and contributions are very welcome. # Ask me anything … about machine learning Today I am running an experiment and opening my blog to machine learning questions. Please ask questions in the comments section. I will try to answer them in the blog over the weekend of the next week or so. Thanks in advance. # When Regularization Fails Another Holiday Blog: Feedback and Questions are very welcome. I have a client who is using Naive Bayes pretty successfully, and the subject has come up as to ‘why do we need fancy machine learning methods?’ A typical question I am asked is: Why do we need Regularization ? What happens if we just turn the regularizer off ? Can we just set it to a very small, default value ? After all, we can just measure what customers are doing exactly. Why not just reshow the most popular items, ads, etc to them. Can we just run a Regression? Or Naive Bayes ? Or just use the raw empirical estimates? Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Sure, if you are simply estimating historical frequencies, n-gram statistics, etc, then a simple method like Naive Bayes is fine. But to predict the future.. the Reason we need Regularization is for Generalization. After all, we don’t just want to make the same recommendations over and over. Unless you are just in love with ad re-targeting (sic). We want to predict on things we have never seen. And we need to correct for presentation bias when collecting data on things we have seen. Most importantly, we need methods that don’t break down unexpectedly. In this post, we will see why Regularization is subtle and non-trivial even in seemingly simple linear models like the classic Tikhonov Regularization. And something that is almost never discussed in modern classes. We demonstrate that ‘strong’ overtraining is accompanied by a Phase Transition–and optimal solutions lies just below this. The example is actually motivated by something I saw recently in my consulting practice–and it was not obvious at first. Still, please realize–we are about to discuss a pathological case hidden in the dark corners of Tikhonov Regularization–a curiosity for mathematicians. See this Quora post on SVD in NLP for some additional motivations. Let’s begin when Vapnik [1], Smale [2], and the other great minds of the field begin: (Regularized) Linear Regression: The most basic method in all statistics is linear regression $\mathbf{X}\mathbf{w}=\mathbf{y}$. It is attributed to Gauss, one of the greatest mathematicians ever. One solution is to assume Gaussian noise and minimize the error. The solution can also be constructed using the Moore-Penrose PseudoInverse. If we multiply by $\mathbf{X}^{\dagger}$ to the left on both sides, we obtain $\mathbf{X}^{\dagger}\mathbf{X}\mathbf{w}=\mathbf{X}^{\dagger}\mathbf{y}$ The formal solution is $\mathbf{\hat{w}}=(\mathbf{X}^{\dagger}\mathbf{X})^{-1}\mathbf{X}^{\dagger}\mathbf{y}$ which is only valid when we can actually invert the data covariance matrix $\mathbf{X}^{\dagger}\mathbf{X}$. We say the problem is well-posed when $(\mathbf{X}^{\dagger}\mathbf{X})^{-1}$ 1. exists, 2. is unique, and 3. is stable. Otherwise we say it is singular, or ill-posed [1]. In nearly all practical methods, we would not compute the inverse directly, but use some iterative technique to solve for $\mathbf{\hat{w}}$. In nearly all practical cases, even when the matrix seems invertible, or even well posed, it may have numerical instabilities, either due to the data or the numerical solver. Vapnik refers to these as stochastically ill posed problems [4] Frequently XTX can not be inverted..or should not be inverted..directly. #### Tikhonov Regularization A trivial solution is to simply ‘regularize’ the matrix $\mathbf{X}^{\dagger}\mathbf{X}$ by adding a small, non-zero value to the diagonal: $(\mathbf{X}^{\dagger}\mathbf{X})^{-1}\rightarrow(\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})^{-1}$. Naively, this seems like it would dampen instabilities and allow for a robust numerical solution. And it does…in most cases. If you want to sound like you are doing some fancy math, give it a Russian name; we call this Tikhonov Regularization [4]: $\mathbf{\hat{w}}(\lambda)=(\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})^{-1})\mathbf{X}^{\dagger}\mathbf{y}$ This is (also) called Ridge Regression in many common packages such as scikit learn. The parameter $\lambda$ is the regularization parameter values, usually found with Cross Validation (CV). While a data science best practice, CV can fail ! Grid searching $\lambda$, we can obtain the infamous L-curve: The L-curve is a log-log plot of the the norm of the regularized solution vs. the norm of the residual. The best $\lambda$ lies somewhere along the bottom of the L, but we can’t really tell where (even with cross validation). This challenge of regularization is absent from basic machine learning classes? The optimal $\lambda$ is actually hard to find. Cross Validation (CV) only works in ideal cases. And we need a CV metric, like $R^{2}$, which also only works in ideal cases. We might naively imagine $\lambda$ can just be very small–in effect, turning off the regularizer. This is seen in the curve above, where the solution norm diverges. The regularization fails in these 2 notable cases, when 1. the model errors are correlated, which fools simple cross validation 2. $\lambda\rightarrow 0$ and the number of features ~ the number of training instances, which leads to a phase transition. Today we will examine this curious, spurious behavior in case 2 (and look briefly at 1) #### Regimes of Applications As von Neumann said once, “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk” To understand where the regularization can fail, we need to distinguish between the cases in which Ridge Regression is commonly used. Say we have $N_{f}$ features, and $N_{I}$ instances. We distinguish between the High Bias and High Variance regimes. ##### High Variance: $N_{f}\gg N_{I}$ This regime is overdetermined, complicated models that are subject to overtraining. We find multiple solutions which satisfy our constraints. Most big data machine learning problems lie here. We might have 100M documents and 100K words. Or 1M images and simply all pixels as the (base) features. Regularization lets us pick the best of solution which is the most smooth (L2 norm), or the most sparse (L1 norm), or maybe even the one with the least presentation bias (i.e. using Variance regularization to implement Counterfactural Risk Minimization ##### High Bias regime: $N_{f}\ll N_{I}$ This is the Underdetermined regime, and any solution we find, regularized or not, will most likely generalize poorly. When we have more features than instances, there is no solution at all (let alone a unique one) Still, we can pick a regularizer, and the effect is similar to dimensional reduction. Tikhonov Regularization is similar truncated SVD (explained below) Any solution we find may work, but the predictions will be strongly biased towards the training data. In between these 2 limits is an seemingly harmless case, however, this is really a ##### Danger Zone: $N_{f}\approx N_{I}$ This is a rare case where the number of features = the number of instances. This does come up in practical problems, such as classifying a large number of small text phrases. (Something I have been working on with one of my larger clients) The number of phrases ~ the number of unique words. This is a dangerous case … not only because it seems so simple … but because the general theory breaks down. Fortunately, it is only pathologicial in #### The limit of zero regularization We examine how these different regimes behave for small values of $\lambda$ $\underset{\lambda\rightarrow 0}{\lim}\;(\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})^{-1}\mathbf{X}^{\dagger}\mathbf{y}$ Let’s formalize and consider the how the predicted accuracy behaves. ##### The Setup: An Analytical Formulation We analyze our regression under Gaussian noise $\mathbf{\xi}$. Let $y=\mathbf{X}\mathbf{w}*+\mathbf{\xi}$ where $\mathbf{\xi}$ is a Gaussian random variable with unit mean and variance $\sigma^{2}$ $\mathbf{\xi}=N(0,\sigma^{2})$ This simple model lets us analyze predicted Accuracy analytically. Let $\mathbf{\hat{w}}$ be the optimal Estimate, and $\mathbf{w^{*}}$ be the Ground Truth, and $\mathbf{\bar{w}}=\mathbb{E}_{\xi}[\mathbf{\hat{w}}]$ be expected (mean). We would like define, and the decompose, the Generalization Accuracy into Bias and Variance contributions. Second, to derive the Generalization Error, we need to work out how the estimator behaves as a random variable. Frequently, in Machine Learning research, one examines the worst case scenario. Alternatively, we use the average case. Define the Estimation Error $Err(\mathbf{\hat{w}})=\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{w^{*}}\Vert^{2}$ Notice that by Generalization Error, we usually we want to know how our model performs on a hold set or test point $\mathbf{x}$: $\mathbb{E}_{\mathbf{\xi}}[(\mathbf{x^{\dagger}}(\hat{\mathbf{w}}-\mathbf{w^{*}}))^{2}]$ In the Appendix, we work out the generalization bounds for the worst case and the average case. From here on, however, when we refer to Generalization Error, we mean the average case Estimation Error (above). ##### Bias-Variance Tradeoff For Ridge Regression, the mean estimator is $\mathbf{\bar{w}}=(\mathbf{X^{\dagger}X}-\lambda\mathbf{I})^{-1}\mathbf{X^{\dagger}X}\mathbf{w^{*}}$ and its variation is $\mathbf{\hat{w}}-\mathbf{\bar{w}}=(\mathbf{X^{\dagger}X}-\lambda\mathbf{I})^{-1}\mathbf{X^{\dagger}}\mathbf{\xi}$ We can now decompose the Estimation Error into 2 terms: $\mathbb{E}_{\xi}=\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}+\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert^{2}\;\;$, where $\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert$ is the Bias, and $\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}$ is the Variance, and examine each regime in the limit $\lambda\rightarrow 0$ The Bias is just the error of the average estimator: $\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=\lambda^{2}\Vert((\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})\mathbf{w^{*}}\Vert^{2}$ and the Variance is the trace of the Covariance of the estimator $\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}=\sigma^{2}\;Tr\bigg(\dfrac{\mathbf{X}^{\dagger}\mathbf{X}}{(\mathbf{X}^{\dagger}\mathbf{X}+\lambda\mathbf{I})^{2}}\bigg)$ We can examine the limiting behavior of these statistics by looking at the leading terms in a series expansion for each. Consider the Singular Value Decomposition (SVD) of $X$ $\mathbf{X}=\mathbf{USV}$. Let $\{s_{1},s_{2},\cdots,s_{m}\}$ be the positive singular values, and $\{\mathbf{v}_{i}\}$ be the right singular vectors. We can write the Bias as $\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=\sum\limits_{1}^{N_{f}}\big(\dfrac{\lambda\mathbf{v}_{i}^{\dagger}\mathbf{w}^{*}}{(s_{i}^{2}+\lambda)}\big)^{2}$ With no regularization, and the number of features exceeds the number of instances, there is a strong bias, determined by the ‘extra’ singular values $\lim\limits_{\lambda\rightarrow 0}\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=\sum\limits_{N_{i}+1}^{N_{f}}(\mathbf{v}_{i}^{\dagger}\mathbf{w}^{*})^{2}\;\;when\;\;N_{f}\gg N_{I}$ Otherwise, the Bias vanishes. $\lim\limits_{\lambda\rightarrow 0}\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=0\;\;when\;\;N_{f}\le N_{I}$ So the Bias appears to only matter in the underdetermined case. (Actually, this is misleading; in the overdetermined case, we can introduce a bias when tuning the regularization parameter too high) In contrast, the Variance $\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}=\sigma^{2}\sum\limits_{1}^{m}\dfrac{s_{i}^{2}}{(s^{2}_{i}+\lambda)^{2}}$ $\lim\limits_{\lambda\rightarrow 0}\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}=\sum\limits_{1}^{m}s_{i}^{-2}$ can diverge if the minimal singular value is small: $\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}\rightarrow\infty \;\;when\;\; s_{0}\sim 0$ That is, if the singular values decrease too fast, the variance can explode. And this can happen in the danger zone When $N_{f}\sim N_{i}$ the variance can be infinite! Which also means the Central Limit Theorem (CLT) breaks down…at least how we normally use it. Traditionally, the (classical) CLT says that the infinite sum of i.i.d. random variables $\{x_{i}\}$ converges to a Gaussian distribution–when the variance of the ${x_{i}}$ is finite. We can, however, generalize the CLT to show that, when the variance is infinite, the sum converges to a Levy or $\alpha$-stable, power-law distribution. These power-law distributions are quite interesting–and arise frequently in chemistry and physics during a phase transition. But first, let’s see the divergent behavior actually arise. #### When More is Less: Singularities Ryota Tomioka [3] has worked some excellent examples using standard data sets [I will work up some ipython notebook examples soon] The Spambase dataset is a great example. It has $N_{f}=57$ features, and $N_{i}=4601$ instances. We run a Ridge Regression, with $N_{i}\in(0,4601]$, with $\lambda=10^{-6}$. As the data set size $N_{f}$ increases, the accuracy sharps to zero at $N_{f}=N_{f}=57$, and then increases sharply again, saturating at$latex N_{f}\sim 10^{3} \$.

Many other datasets show this anomalous behavior at $N_{f}=N_{f}$, as predicted by our analysis above.

#### Ridge Regression vs Logistic Regression: Getting out of the Hole

Where Ridge Regression fails, Logistic Regression bounces back.  Below we show some examples of RR vs LR in the $N_{f}=N_{f}$ danger zone

Logistic Regression still has problems, but it no where near as pathological as RR.

The reason for this is subtle.

• In Regression, the variance goes to infinity
• In Classification, the norm $\Vert\mathbf{\hat{w}}\Vert$ goes to infinity

#### Traditional ML Theory:  Isn’t the Variance Bounded?

Traditional theory (very losely) bounds for quantities like the generalization error and (therefore) the variance.

We work out some of examples in the Appendix.

Clearly these bounds don’t work in all cases.  So what do they really tell us?  What’s the point?

These theories gives us some confidence that we can actually apply a learning algorithm–but only when the regularizer is not too small and the noise is not too large.

They essentially try to get us far away from the pathological phase transition that arises when the variance diverges.

#### Cross Validation and Correlated Errors

In practice, we would never just cross our fingers set $\lambda=0.00000001$.

We pick a hold out set and run Generalized Cross Validation (GCV) . Yet this can fail when

1.  the model errors $\xi$ are not i.i.d. but are strongly correlated
2.  we don’t know what accuracy metric to use?

Hansen [5, 7] has discussed these issue in detail…and he provides an alternative method, the L-curve, which attempt to balance the size of the regularized solution and the residuals.

For example, we can sketch the relative errors $\dfrac{\Vert w-w(\lambda)\Vert}{\Vert w\Vert_{2}}$ for the L-curve and GVC method (see 7).  Above we see that the L-curve relative errors are more Gaussian than GCV.

In other words, while most packages provide a small choice of regression metrics; as data scientists, the defaults like $R^{2}$ may not be good enough.  Even using the 2-norm may not represent the errors well. As data scientists, we may need to develop our own metrics to get a good $\lambda$ for a regression.  (or maybe just use an SVM regression).

Hansen claims that,”experiments confirm that whenever GCV finds a good regularization parameter, the corresponding solution is located at the corner of the L-curve.” [7]

This means that the optimal regularized solution lives ‘just below’ the phase transition, where the norm diverges.

#### Phase Transitions in Regularized Regression

What do we mean by a phase transition?

The simplest thing is to we plot a graph and look a steep cliff or sharp change.

Here, generalization accuracies drops off suddenly as $\dfrac{N_{f}}{N_{i}}\rightarrow 1\;\;and\;\;\alpha\rightarrow 0$

In a physical phase transition,the fluctuations (i.e. variances and norms) in the system approach infinity.

Imagine watching a pot boil

We see bubbles of all sizes, small and large.  The variation in bubble size is all over the map.  This is characteristic of a phase transition: i.e. water to gas.

When we cook, however, we frequently simmer our foods–we keep the water just below the boiling point.  This is as hot as the water can get before it changes to gas.

Likewise, it appears that in Ridge Regression, we frequently operate at a point just below the phase transition–where the solution norm explodes.  And this is quite interesting.  And, I suspect, may be important generally.

In chemical physics, we need special techniques to treat this regime, such as the Renormalization Group. Amazingly, Deep Learning / Restricted Boltzmann Machines  look very similar to the Variational Renormalization Group method.

In our next post, we will examine this idea further, by looking at the phase diagram of the traditional Hopfield Neural Networ, and the idea of Replica Symmetry Breaking in the Statistical Mechanics of Generalization. Stay tuned !

#### References

1. Vapnik and Izmailov, V -Matrix Method of Solving Statistical Inference Problems, JMLR 2015

2. Smale, On the Mathematical Foundations of Learning, 2002

3. https://github.com/ryotat

# Appendix:  some math

#### Bias-Variance Decomposition

$\mathbf{y}=\mathbf{X}{w}+\mathbf{\xi}$,

when $\mathbb{E}[\mathbf{\xi}]=0$,

then $Cov(\mathbf{\xi})=\sigma^{2}\mathbf{I}$

Let

$\mathbf{\Sigma}=\dfrac{1}{N_{i}}\mathbf{X^{\dagger}X}$

be the data covariance matrix.  And , for convenience, let

$\lambda_{n}=\dfrac{\lambda}{n}$.

The Expected value of the optimal estimator, assuming Gaussian noise, is given using the Penrose Pseudo-Inverse

$\mathbb{E}\mathbf{\hat{x}}=(\mathbf{\Sigma}-\lambda_{n}\mathbf{I})^{-1}\mathbf{\Sigma}\mathbf{w^{*}}$

The Covariance is

$Cov(\mathbf{\hat{x}})=\dfrac{\sigma^{2}}{n}(\mathbf{\Sigma}-\lambda_{n}\mathbf{I})^{-1}\mathbf{\Sigma}(\mathbf{\Sigma}-\lambda_{n}\mathbf{I})^{-1}$

The regularized Covariance matrix arises so frequently that we will assign it a symbol

$\mathbf{\Sigma}_{\lambda}=(\mathbf{\Sigma}-\lambda_{n}\mathbf{I})^{-1}$

We can now write down the mean and covariance

$\mathbb{E}_{\xi}\Vert\hat{w}-\bar{w}\Vert^{2}=Tr(Cov(\mathbf{\hat{w}}))$

[more work needed here]

#### Average and Worst Case Generalization Bounds

Consider the Generalization Error for a test point $\mathbf{x}$

$\mathbb{E}_{\mathbf{\xi}}[(\mathbf{x^{\dagger}}(\hat{\mathbf{w}}-\mathbf{w^{*}}))^{2}] =\lambda^{2}_{n}[\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}]^{2}+\dfrac{\sigma^{2}_{n}}{n}\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{\Sigma}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}$

We can now obtain some very simple bounds on the Generalization Accuracy (just  like a bona-a-fide ML researcher)

##### The worst case bounds

$[\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}]^{2}\le\Vert\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{x}\Vert^{2}\Vert\mathbf{w^{*}}\Vert^{2}$

##### The average case bounds

$\mathbb{E}_{\mathbf{w^{*}}}[\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}]^{2}\le\alpha^{-1}\Vert\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{x}\Vert^{2}$,

where we assume the ‘covariance*’ is

$\mathbb{E}_{\mathbf{w^{*}}}[\mathbf{w^{*}w^{*\dagger}}]=\alpha^{-1}\mathbf{I}$

#### Discrete Picard Condition

We can use the SVD approach to make stronger statements about our ability to solve the inversion problem

$\mathbf{Xw}=\mathbf{y}$.

I will sketch an idea called the “Discrete Picard Condition” and provide some intuition

Write $\mathbf{X}$ as a sum over singular vectors

$\mathbf{X}=\mathbf{U\Sigma V^{\dagger}}=\sum\limits_{i=1}^{N}\mathbf{u_{i}}\sigma_{i}\mathbf{v_{i}^{\dagger}}$.

Introduce an abstract PseudoInverse,  defined by

$\mathbf{X}^{-\dagger}\mathbf{X}=\mathbf{I}$

Express the formal solution as

$\mathbf{w}=\mathbf{X}^{-\dagger}\mathbf{y}$

Imagine the SVD of the PseudoInverse is

$\mathbf{X}^{-\dagger}=\mathbf{U^{\dagger}\Sigma^{-1}V}=\sum\limits_{i=1}^{N}\mathbf{u_{i}^{\dagger}}\sigma^{-1}_{i}\mathbf{v_{i}}$.

and use it to obtain

$\mathbf{w}=\sum\limits_{i=1}^{N}\dfrac{\mathbf{u^{\dagger}_{i}y}}{\sigma_{i}}\mathbf{v_{i}}$.

For this expression to be meaningful, the SVD coefficients $\mathbf{u^{\dagger}_{i}b}$ must decay, on average, faster than the corresponding singular values $\sigma{i}$.

Consequently, the only coefficients that carry any information are larger than the noise level.  The small coefficients have small singular values and are dominated by noise.    As the problem becomes more difficult to solve uniquely, the singular values decay more rapidly.

If we discard the $k+1,N$ singular components,  we obtain a regularized, unsupervised solution called

##### Truncated SVD

$\mathbf{w}=\sum\limits_{i=1}^{k}\dfrac{\mathbf{u^{\dagger}_{i}y}}{\sigma_{i}}\mathbf{v_{i}}$.

which is similar in spirit to spectral clustering.