Revisiting Natural Gradient for Deep Networks

Authors: Razvan Pascanu; Yoshua Bengio

ICLR 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally in section 10 we provide a benchmark of the algorithm discussed... We explore empirically these two hypotheses on the Toronto Face Dataset (TFD)... Fig. 1 shows the training and test error of a model trained on fold 4 of TFD... We carry out a benchmark on the Curves dataset, using the 6 layer deep auto-encoder from Martens (2010).
Researcher Affiliation Academia Razvan Pascanu Universit e de Montr eal Montr eal QC H3C 3J7 Canada r.pascanu@gmail.com Yoshua Bengio Universit e de Montr eal Montr eal QC H3C 3J7 Canada yoshua.bengio@umontreal.ca
Pseudocode Yes The full pseudo-code of the algorithm (which is very similar to the one for Hessian-Free Optimization) is given below. Algorithm 2 Pseudocode for natural gradient descent algorithm
Open Source Code Yes Both Minres-QLP as well as linear conjugate gradient can be found implemented in Theano at https://github.com/pascanur/theano_optimize.
Open Datasets Yes We explore empirically these two hypotheses on the Toronto Face Dataset (TFD) (Susskind et al., 2010)... We repeat the experiment from Erhan et al. (2010), using the NISTP dataset introduced in Bengio et al. (2011)... We carry out a benchmark on the Curves dataset, using the 6 layer deep auto-encoder from Martens (2010).
Dataset Splits Yes Hyper-parameters have been selected using a grid search (more details in the appendix). ... based on the validation cost obtained for each configuration. ... In order to be fair to the two algorithms compared in the plot, natural gradient descent and stochastic gradient descent, we use the error on a different validation set as a measure of how much we moved in the functional space.
Hardware Specification Yes The benchmark is run on a GTX 580 Nvidia card, using Theano (Bergstra et al., 2010a) for cuda kernels.
Software Dependencies No The paper mentions software like Theano and scipy.optimize.fmin_cobyla but does not provide specific version numbers for these software dependencies, only citations for the underlying concepts or general library.
Experiment Setup Yes We used a two layer model, where the first layer is convolutional. It uses 512 filters of 14X14, and applies a sigmoid activation function. ... We ended up using a fixed learning rate of 0.2 (with no line search) and adapting the damping coefficient using the Levenberg-Marquardt heuristic. ... For the curves experiment we used a deep autoencoder with 400-200-100-50-25-6 hidden units respectively... For natural gradient ... uses small batches of 5000 examples and a fixed learning rate of 1.0. For the SGD case we use a smaller batch size of 100 examples. The optimum learning rate, obtained by a grid search, is 0.01.