Scaling up Natural Gradient by Sparsely Factorizing the Inverse Fisher Matrix

Authors: Roger Grosse, Ruslan Salakhudinov

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we first evaluate FANG by comparing the accuracy of the approximation Gfac with various generic approximations to PSD matrices. Next, we evaluate its ability to train binary restricted Boltzmann machines as generative models, compared with SGD, both with and without the centering trick. ... Our RBM training experiments were conducted on two datasets: the MNIST handwritten digit dataset... and the more complex Omniglot dataset...
Researcher Affiliation Academia Roger B. Grosse RGROSSE@CS.TORONTO.EDU Ruslan Salakhutdinov RSALAKHU@CS.TORONTO.EDU Department of Computer Science, University of Toronto
Pseudocode Yes Algorithm 1 Factorized Natural Gradient (FANG) for binary RBMs
Open Source Code No The paper does not provide a link or explicit statement about the availability of open-source code for the described methodology.
Open Datasets Yes Our RBM training experiments were conducted on two datasets: the MNIST handwritten digit dataset... and the more complex Omniglot dataset of handwritten characters in a variety of world languages (Lake et al., 2013).
Dataset Splits No Our RBM training experiments were conducted on two datasets: the MNIST handwritten digit dataset... and the more complex Omniglot dataset... (Lake et al., 2013). ... We used 2000 PCD particles, mini-batches of size 2000, and a learning rate schedule of α p γ/(γ + t), where t is the update count, γ = 1000, and α was tuned separately for each algorithm.
Hardware Specification No Our implementation made use of the CUDAMat (Mnih, 2009) and Gnumpy (Tieleman, 2010) libraries for GPU linear algebra operations.
Software Dependencies No Our implementation made use of the CUDAMat (Mnih, 2009) and Gnumpy (Tieleman, 2010) libraries for GPU linear algebra operations.
Experiment Setup Yes We used 2000 PCD particles, mini-batches of size 2000, and a learning rate schedule of α p γ/(γ + t), where t is the update count, γ = 1000, and α was tuned separately for each algorithm.