Natural Neural Networks

Authors: Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, koray kavukcuoglu

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We highlight the benefits of our method on both unsupervised and supervised learning tasks, and showcase its scalability by training on the large-scale Image Net Challenge dataset. 4 Experiments We begin with a set of diagnostic experiments which highlight the effectiveness of our method at improving conditioning. We also illustrate the impact of the hyper-parameters T and , controlling the frequency of the reparametrization and the size of the trust region. Section 4.2 evaluates PRONG on unsupervised learning problems, where models are both deep and fully connected. Section 4.3 then moves onto large convolutional models for image classification.
Researcher Affiliation Industry Google Deep Mind, London {gdesjardins,simonyan,razp,korayk}@google.com
Pseudocode Yes Algorithm 1 Projected Natural Gradient Descent
Open Source Code No The paper does not provide any explicit statement or link regarding the availability of open-source code for the described methodology.
Open Datasets Yes scaling our method from standard deep auto-encoders to large convolutional models on Image Net[20], trained across multiple GPUs. This is to our knowledge the first-time a (non-diagonal) natural gradient algorithm is scaled to problems of this magnitude. We train a small 3-layer MLP with tanh non-linearities, on a downsampled version of MNIST (10x10) [11]. Results are presented on CIFAR-10 [9] and the Image Net Challenge (ILSVRC12) datasets [20].
Dataset Splits Yes Model selection was performed on a held-out validation set of 5k examples. On CIFAR-10, PRONG achieves better test error and converges faster. On Image Net, PRONG+ achieves comparable validation error while maintaining a faster covergence rate.
Hardware Specification No trained across multiple GPUs and Eight GPUs were used for computing gradients and estimating model statistics. While GPUs are mentioned, no specific models or other hardware details (CPU, RAM) are provided.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes Figures 3a3b highlight the effect of the eigenvalue regularization term and the reparametrization interval T. Note that these timing numbers reflect performance under the optimal choice of hyper-parameters, which in the case of batch normalization yielded a batch size of 256, compared to 128 for all other methods. The model was trained on 24 24 random crops with random horizontal reflections.