Natural Neural Networks
Authors: Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, koray kavukcuoglu
NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We highlight the benefits of our method on both unsupervised and supervised learning tasks, and showcase its scalability by training on the large-scale Image Net Challenge dataset. 4 Experiments We begin with a set of diagnostic experiments which highlight the effectiveness of our method at improving conditioning. We also illustrate the impact of the hyper-parameters T and , controlling the frequency of the reparametrization and the size of the trust region. Section 4.2 evaluates PRONG on unsupervised learning problems, where models are both deep and fully connected. Section 4.3 then moves onto large convolutional models for image classification. |
| Researcher Affiliation | Industry | Google Deep Mind, London {gdesjardins,simonyan,razp,korayk}@google.com |
| Pseudocode | Yes | Algorithm 1 Projected Natural Gradient Descent |
| Open Source Code | No | The paper does not provide any explicit statement or link regarding the availability of open-source code for the described methodology. |
| Open Datasets | Yes | scaling our method from standard deep auto-encoders to large convolutional models on Image Net[20], trained across multiple GPUs. This is to our knowledge the first-time a (non-diagonal) natural gradient algorithm is scaled to problems of this magnitude. We train a small 3-layer MLP with tanh non-linearities, on a downsampled version of MNIST (10x10) [11]. Results are presented on CIFAR-10 [9] and the Image Net Challenge (ILSVRC12) datasets [20]. |
| Dataset Splits | Yes | Model selection was performed on a held-out validation set of 5k examples. On CIFAR-10, PRONG achieves better test error and converges faster. On Image Net, PRONG+ achieves comparable validation error while maintaining a faster covergence rate. |
| Hardware Specification | No | trained across multiple GPUs and Eight GPUs were used for computing gradients and estimating model statistics. While GPUs are mentioned, no specific models or other hardware details (CPU, RAM) are provided. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | Figures 3a3b highlight the effect of the eigenvalue regularization term and the reparametrization interval T. Note that these timing numbers reflect performance under the optimal choice of hyper-parameters, which in the case of batch normalization yielded a batch size of 256, compared to 128 for all other methods. The model was trained on 24 24 random crops with random horizontal reflections. |