I am starting a new project to try and reproduce some core deep learning papers in TensorFlow from some of the big names.
The motivation: to understand how to build very deep networks and why they do (or don’t) work.
There are several papers that caught my eye, starting with
- Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition (2010)
- I can find no implementation or data
- Also called student-teacher learning
- there is an implementation, but it is unclear what data was used
These papers set the foundation for looking at much larger, deeper networks such as
- ResNet (Deep Residual Learning)
- there are several TensorFlow implementations. I don’t know which is best
- Highway Networks
- see Jim Flemming’s post on a TensorFlow implementation
- and FractalNet.
- an implementation is needed
FractalNet’s are particularly interesting since they suggest that very deep networks do not need student-teacher learning, and, instead, can be self similar. (which is related to very recent work on the Statistical Physics of Deep Learning, and the Renormalization Group analogy).
IMHO, it is not enough just to implement the code; the results have to be excellent as well. I am not impressed with the results I have seen so far, and I would like to flush out what is really going on.
Big Deep Simple Nets
The 2010 paper still appears to be 1 of the top 10 results on MNIST:
The idea is simple. They claim to get state-of-the-art accuracy on MNIST using a 5-layer MLP, but running a large number of epochs with just SGD, a decaying learning rate, and an augmented data set.
The key idea is that the augmented data set can provide, in practice, an infinite amount of training data. And having infinite data means that we never have to worry about overtraining because we have too many adjustable parameters, and therefore any reasonable size network will do the trick if we just run it long enough.
In other words, there is no convolution gap, no need for early stopping, or really no regularization at all.
This sounds dubious to me, but I wanted to see for myself. Also, perhaps I am missing some subtle detail. Did they clip gradients somewhere ? Is the activation function central ? Do we need to tune the learning rate decay ?
I have initial notebooks on github, and would welcome feedback and contributions, plus ideas for other papers to reproduce.
I am trying to repeat this experiment using Tensorflow and 2 kinds of augmented data sets:
- InfiMNIST (2006) – provides nearly 1B deformations of MNIST
- AlignMNIST (2016) – provides 75-150 epochs of deformed MNIST
(and let me say a special personal thanks to Søren Hauberg for providing this recent data set)
Current results are up for
- 2 Layer AlignMNIST 75 epochs
- 5 LayerAlignMNIST 75 epochs
- 2 Layer InfiMNIST 500 epochs
- 5 Layer InfiMNIST 500 epochs
The initial results indicate that AlignMNIST is much better that InfiMNIST for this simple MLP, although I still do not see the extremely high, top-10 accuracy reported.
Furthermore, the 5-Layer InfiMNIST actually diverges after ~100 epochs. So we still need early stopping, even with an infinite amount of data.
It may be interesting try using the Keras ImageDataGenerator class, described in this related blog on “building powerful image classification models using very little data”
I will periodically update this blog as new data comes in, and I have the time to implement these newer techniques.
Next, we will check in the log files and discuss the tensorboard results.
Comments, criticisms, and contributions are very welcome.