Deep Learning is amazing. But why is Deep Learning so successful? Is Deep Learning just old-school Neural Networks on modern hardware? Is it just that we have so much data now the methods work better? Is Deep Learning just a really good at finding features. Researchers are working hard to sort this out.
Recently it has been shown that 
Unsupervised Deep Learning implements the Kadanoff Real Space Variational Renormalization Group (1975)
This means the success of Deep Learning is intimately related to some very deep and subtle ideas from Theoretical Physics. In this post we examine this.
Unsupervised Deep Learning: AutoEncoder Flow Map
An AutoEncoder is a Unsupervised Deep Learning algorithm that learns how to represent an complex image or other data structure . There are several kinds of AutoEncoders; we care about so-called Neural Encoders–those using Deep Learning techniques to reconstruct the data:
The simplest Neural Encoder is a Restricted Boltzman Machine (RBM). An RBM is non-linear, recursive, lossy function that maps the data from visible nodes into hidden nodes :
The RBM is learned by selecting the optimal parameters that minimize the reconstruction error
RBMs and other Deep Learning algos are formulated using classical Statistical Mechanics. And that is where it gets interesting!
Multi Scale Feature Learning
In old-school ML, we map (visible) data into (hidden) features
The hidden units discover features at a coarser grain level of scale
With RBMs, when features are complex, we may stack them into a Deep Belief Network (DBM), so that we can learn at different levels of scale
and leads to multi-scale features in each layer
Deep Belief Networks are a Theory of Unsupervised MultiScale Feature Learning
Fixed Points and Flow Maps
We call a flow map
If we apply the flow map to the data repeatedly, (we hope) it converges to a fixed point
Notice that we usually expect to apply the same map each time , however, for a computational theory we may need more flexibility.
Example: Linear Flow Map
The simplest example of a flow map is the simple linear map
where C is a non-negative, low rank matrix
We have seen this before: this leads to a Convex form of NonNegative Matrix Factorization NMF
Convex NMF applies when we can specify the feature space and where the data naturally clusters. Here, there are a few instances that are archetypes that define the convex hull of the data.
Amazingly, many clustering problems are provably convex–but that’s a story for another post.
Example: Manifold Learning
Near a fixed point, we commonly approximate the flow map by a linear operator
This lets us capture the structure of the true data manifold, and is usually described by the low lying eigen-spectra of
In the same spirit, Semi & Unsupervised Manifold Learning, we model the data using a Laplacian operator , usually parameterized by a single scale parameter .
These methods include Spectral Clustering, Manifold Regularization , Laplacian SVM, etc. Note that manifold learning methods, like the Manifold Tanget Classifier, employ Contractive Auto Encoders and use several scale parameters to capture the local structure of the data manifold.
The Renormalization Group
In chemistry and physics, we frequently encounter problems that require a multi-scale description. We need this for critical points and phase transitions, for natural crashes like earthquakes and avalanches, for polymers and other macromolecules, for strongly correlated electronic systems, for quantum field theory, and, now, for Deep Learning.
A unifying idea across these systems is the Renormalization Group (RG) Theory.
Renormalization Group Theory is both a conceptual framework on how to think about physics on multiple scales as well as a technical & computational problem solving tool.
Ken Wilson won the 1982 Nobel Prize in Physics for the development and application of his Momentum Space RG theory to phase transitions.
We used RG theory to model the recent BitCoin crash as a phase transition.
Wilson invented modern multi-scale modeling; the so-called Wilson Basis was an early form of Wavelets. Wilson was also a big advocate of using supercomputers for solving problems. Being a Nobel Laureate, he had great success promoting scientific computing. It was thanks to him I had access to a Cray Y-MP when I was in high school because he was a professor at my undergrad, The Ohio State University.
Here is the idea. Consider a feature map which transforms the data to a different, more coarse grain scale
The RG theory requires that the Free Energy is rescaled, to reflect that
the Free Energy is both Size-Extensive and Scale Invariant near a Critical Point
This is not obvious — but it is essential to both having a conceptual understanding of complex, multi scale phenomena, and it is necessary to obtain very highly accurate numerical calculations. In fact, being size extensive and/or size consistent is absolutely necessary for highly accurate quantum chemistry calculations of strongly correlated systems. So it is pretty amazing but perhaps not surprising that this is necessary for large scale deep learning calculations also!
The Fundamental Renormalization Group Equation (RGE)
If we (can) apply the same map, , repeatedly, we obtain a RG recursion relation, which is the starting point for most analytic work in theoretical physics. It is usually difficult to obtain an exact solution to the RGE.
Many RG formulations both approximate the exact RGE and/or only include relevant variables. To describe a multiscale system, it is essential to distinguish between these relevant and irrelevant variables.
Example: Linear Rescaling
Let’s say the feature map is a simple linear rescaling
We can obtain a very elegant, approximate RG solution where F(x) obeys a complex (or log-periodic) power law.
This behavior is thought to characterize Per-Bak style Self-Organized Criticality (SOC), which appears in many natural systems–and perhaps even in the brain itself. Which leads to the argument that perhaps Deep Learning and Real Learning work so well because they operate like a system just near a phase transition–also known as the Sand Pile Model--operating at a state between order and chaos.
the Kadanoff Variational Renormalization Group (1975)
Leo Kadanoff, now at the University of Chicago, invented some of the early ideas in Renormalization Group. He is most famous for the Real Space formulation of RG, sometimes called the Block Spin approach. He also developed an alternative approach, called the Variational Renormalization Group (VRG, 1975), which is, remarkably, what Unsupervised DBNs are implementing!
Let’s consider a traditional Neural Network–a Hopfield Associative Memory (HAM). This is also known as the Ising model or a Spin Glass in statistical physics.
An HAM consists of only visbile units; it stores memories explicitly and directly in them:
We specify the Energy — called the Hamiltonian — for the nodes. Note that all the nodes are visible. We write
The Hopfield model has only single and pair-wise interactions.
A general Hamiltonian might have many-body, multi-scale interactions:
The Partition Function is given as
And the Free Energy is
The idea was to mimic how our neurons were thought to store memories–although perhaps our neurons do not even do this.
Either way, Hopfield Neural Networks have many problems; most notably they may learn spurious patterns that never appeared in the training set. So they are pretty bad memories.
Hinton created the modern RBM to overcome the problems of the Hopfield model. He used hidden units to represent the features in the data–not to memorize the data examples directly.
An RBM is specified Energy function for both the visible and hidden units
This also defines joint probability of simultaenously observing a configuration of hidden and visible spins
which is learned variationally, by minimizing the reconstruction error…or the cross entropy (KL divergence), plus some regularization (Dropout), using Greedy layer-wise unsupervised training, with the Contrastive Divergence (CD or PCD) algo, …
The specific details of an RBM Energy are not addressed by these general concepts; these details do not affect these arguments–although clearly they matter in practice !
It turns out that
Introducing Hidden Units in a Neural Network is a Scale Renormalization.
When changing scale, we obtain an Effective Hamiltonian that acts on a the new feature space (i.e the hidden units)
or, in operator form
This Effective Hamiltonian is not specified explicitly, but we know it can take the general form (of a spin funnel, actually)
The RG transform preservers the Free Energy (when properly rescaled):
Critical Trajectories and Renormalized Manifolds
The RG theory provides a way to iteratively update, or renormalize, the system Hamiltonian. Each time we add a layer of hidden units (h1, h2, …), we have
We imagine that the flow map is attracted to a Critical Trajectory which naturally leads the algorithm to the fixed point. At each step, when we apply another RG transform, we obtain a new, Renormalized Manifold, each one closer to the optimal data manifold.
Conceptually, the RG flow map is most useful when applied to critical phenomena–physical systems and/or simple models that undergo a phase transition. And, as importantly, the small changes in the data should ‘wash away’ as noise and not affect the macroscopic / critical phenomena. Many systems–but not all–display this.
Where Hopfield Nets fail to be useful here, RBMs and Deep Learning systems shine.
We now show that these RG transformations are achieved by stacking RBMs and solving the RBM inference problem!
Kadanoff’s Variational Renormalization Group
As in many physics problems, we break the modeling problem into two parts: one we know how to solve, and one we need to guess.
- we know the Hamiltonian at the most fine grained level of scale
- we seek the correlation that couples to the next level scale
The joint Hamiltonian, or Energy function, is then given by
The Correlation V(v,h) is defined so that the partition function is not changed
This gives us
(Sometimes the Correlation V is called a Transfer Operator T, where V(v,h)=-T(v,h) )
We may now define a renormalized effective Hamilonian that acts only on the hidden nodes
so that we may write
Because the partition function does not change, the Exact RGE preserves the Free Energy (up to a scale change, we we subsume into
We generally can not solve the exact RGE–but we can try to minimize this Free Energy difference.
What Kadanoff showed, way back in 1975, is that we can accurately approximate the Exact Renormalization Group Equation by finding a lower bound using this formalism
Deep learning appears to be a real-space variational RG technique, specifically applicable to very complex, inhomogenous systems where the detailed scale transformations have to be learned from the data
RBMs expressed using Variational RG
We will now show how to express RBMs using the VRG formalism and provide some intuition
In an RBM, we simply want to learn the Energy function directly; we don’t specify the Hamiltonian for the visible or hidden units explicitly, like we would in physics. The RBM Energy is just
We identify the Hamiltonian for the hidden units as the Renormalized Effective Hamiltonian from the VRG theory
RBM Hamiltonians / Marginal Probabilities
To obtain RBM Hamiltonians for just the visible or hidden nodes, we need to integrate out the other nodes; that is, we need to find the marginal probabilities.
To train an RBM, we apply Contrastive Divergence (CD), or, perhaps today, Persistent Contrastive Divergence (PCD). We can kindof think of this as slowly approximating
In practice, however, RBM training minimizes the associated Free Energy difference … or something akin to this…to avoid overfitting.
In the “Practical Guide to Training Restricted Boltzmann Machines”, Hinton explains how to train an RBM (circa 2011). Section 6 addresses “Monitoring the overfitting”
“it is possible to directly monitor the overfitting by comparing the free energies of training data and held out validation data…If the model is not overfitting at all, the average free energy should be about the same on training and validation data”
Other Objective Functions
Modern variants of Real Space VRG are not “‘forced’ to minimize the global free energy” and have attempted other approaches such as Tensor-SVD Renormalization. Likeswise, some RBM / DBM approaches do likewise may minimize a different objective.
In some methods, we minimize the KL Divergence; this has a very natural analog in VRG language .
Why Deep Learning Works: Lessons from Theoretical Physics
The Renormalization Group Theory provides new insights as to why Deep Learning works so amazingly well. It is not, however, a complete theory. Rather, it is framework for beginning to understand what is an incredibly powerful, modern, applied tool. Enjoy!
 THE RENORMALIZATION GROUP AND CRITICAL PHENOMENA, Ken Wilson Nobel Prize Lecture
 The Potential Energy of an Autoencoder, 2014