Revisiting Natural Gradient for Deep Networks
Authors: Razvan Pascanu; Yoshua Bengio
ICLR 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally in section 10 we provide a benchmark of the algorithm discussed... We explore empirically these two hypotheses on the Toronto Face Dataset (TFD)... Fig. 1 shows the training and test error of a model trained on fold 4 of TFD... We carry out a benchmark on the Curves dataset, using the 6 layer deep auto-encoder from Martens (2010). |
| Researcher Affiliation | Academia | Razvan Pascanu Universit e de Montr eal Montr eal QC H3C 3J7 Canada r.pascanu@gmail.com Yoshua Bengio Universit e de Montr eal Montr eal QC H3C 3J7 Canada yoshua.bengio@umontreal.ca |
| Pseudocode | Yes | The full pseudo-code of the algorithm (which is very similar to the one for Hessian-Free Optimization) is given below. Algorithm 2 Pseudocode for natural gradient descent algorithm |
| Open Source Code | Yes | Both Minres-QLP as well as linear conjugate gradient can be found implemented in Theano at https://github.com/pascanur/theano_optimize. |
| Open Datasets | Yes | We explore empirically these two hypotheses on the Toronto Face Dataset (TFD) (Susskind et al., 2010)... We repeat the experiment from Erhan et al. (2010), using the NISTP dataset introduced in Bengio et al. (2011)... We carry out a benchmark on the Curves dataset, using the 6 layer deep auto-encoder from Martens (2010). |
| Dataset Splits | Yes | Hyper-parameters have been selected using a grid search (more details in the appendix). ... based on the validation cost obtained for each configuration. ... In order to be fair to the two algorithms compared in the plot, natural gradient descent and stochastic gradient descent, we use the error on a different validation set as a measure of how much we moved in the functional space. |
| Hardware Specification | Yes | The benchmark is run on a GTX 580 Nvidia card, using Theano (Bergstra et al., 2010a) for cuda kernels. |
| Software Dependencies | No | The paper mentions software like Theano and scipy.optimize.fmin_cobyla but does not provide specific version numbers for these software dependencies, only citations for the underlying concepts or general library. |
| Experiment Setup | Yes | We used a two layer model, where the first layer is convolutional. It uses 512 filters of 14X14, and applies a sigmoid activation function. ... We ended up using a fixed learning rate of 0.2 (with no line search) and adapting the damping coefficient using the Levenberg-Marquardt heuristic. ... For the curves experiment we used a deep autoencoder with 400-200-100-50-25-6 hidden units respectively... For natural gradient ... uses small batches of 5000 examples and a fixed learning rate of 1.0. For the SGD case we use a smaller batch size of 100 examples. The optimum learning rate, obtained by a grid search, is 0.01. |