I am starting a new project to try and reproduce some core deep learning papers in TensorFlow from some of the big names.
The motivation: to understand how to build very deep networks and why they do (or don’t) work.
There are several papers that caught my eye, starting with
 Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition (2010)
 I can find no implementation or data

Unifying Distillation and Privileged Information (2015)
 Also called studentteacher learning
 there is an implementation, but it is unclear what data was used
These papers set the foundation for looking at much larger, deeper networks such as
 ResNet (Deep Residual Learning)
 there are several TensorFlow implementations. I don’t know which is best
 Highway Networks
 see Jim Flemming’s post on a TensorFlow implementation
 and FractalNet.
 an implementation is needed
FractalNet’s are particularly interesting since they suggest that very deep networks do not need studentteacher learning, and, instead, can be self similar. (which is related to very recent work on the Statistical Physics of Deep Learning, and the Renormalization Group analogy).
IMHO, it is not enough just to implement the code; the results have to be excellent as well. I am not impressed with the results I have seen so far, and I would like to flush out what is really going on.
Big Deep Simple Nets
The 2010 paper still appears to be 1 of the top 10 results on MNIST:
http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html
The idea is simple. They claim to get stateoftheart accuracy on MNIST using a 5layer MLP, but running a large number of epochs with just SGD, a decaying learning rate, and an augmented data set.
The key idea is that the augmented data set can provide, in practice, an infinite amount of training data. And having infinite data means that we never have to worry about overtraining because we have too many adjustable parameters, and therefore any reasonable size network will do the trick if we just run it long enough.
In other words, there is no convolution gap, no need for early stopping, or really no regularization at all.
This sounds dubious to me, but I wanted to see for myself. Also, perhaps I am missing some subtle detail. Did they clip gradients somewhere ? Is the activation function central ? Do we need to tune the learning rate decay ?
I have initial notebooks on github, and would welcome feedback and contributions, plus ideas for other papers to reproduce.
I am trying to repeat this experiment using Tensorflow and 2 kinds of augmented data sets:
 InfiMNIST (2006) – provides nearly 1B deformations of MNIST
 AlignMNIST (2016) – provides 75150 epochs of deformed MNIST
(and let me say a special personal thanks to Søren Hauberg for providing this recent data set)
I would like to try other methods, such as the Keras Data Augmentation library (see below), or even the recent data generation library coming out of OpenAI.
Current results are up for
 2 Layer AlignMNIST 75 epochs
 5 LayerAlignMNIST 75 epochs
 2 Layer InfiMNIST 500 epochs
 5 Layer InfiMNIST 500 epochs
The initial results indicate that AlignMNIST is much better that InfiMNIST for this simple MLP, although I still do not see the extremely high, top10 accuracy reported.
Furthermore, the 5Layer InfiMNIST actually diverges after ~100 epochs. So we still need early stopping, even with an infinite amount of data.
It may be interesting try using the Keras ImageDataGenerator class, described in this related blog on “building powerful image classification models using very little data”
Also note that the OpenAI group as released a new paper and code for creating data used in generative adversarial networks (GANs).
I will periodically update this blog as new data comes in, and I have the time to implement these newer techniques.
Next, we will check in the log files and discuss the tensorboard results.
Comments, criticisms, and contributions are very welcome.