Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks

Authors: Guodong Zhang, James Martens, Roger B. Grosse

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this work, we analyze for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss. We identify two conditions which guarantee efficient convergence from random initializations: (1) the Jacobian matrix (of network s output for all training cases with respect to the parameters) has full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer Re LU neural networks, we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization.
Researcher Affiliation Collaboration Guodong Zhang1,2, James Martens3, Roger Grosse1,2 University of Toronto1, Vector Institute2, Deep Mind3 {gdzhang, rgrosse}@cs.toronto.edu, jamesmartens@google.com
Pseudocode No No pseudocode or algorithm block was found.
Open Source Code No The paper does not provide an unambiguous statement or link to open-source code for the methodology described.
Open Datasets No Figure 1: Visualization of natural gradient update and gradient descent update in the output space (for a randomly initialized network). We take two classes (4 and 9) from MNIST [Le Cun et al., 1998] and generate the targets (denoted as star in the figure) by f(x) = x 0.5 + 0.3 N(0, I) where x 2 R2 is one-hot target. This is for an illustrative visualization, not a primary experimental setup, and no direct access information (link, DOI) is provided for the data used in this visualization.
Dataset Splits No The paper focuses on theoretical analysis and proofs, not empirical experiments with specified training, validation, and test dataset splits.
Hardware Specification No No specific hardware (e.g., GPU models, CPU types, memory) used for any computations or visualizations is mentioned in the paper.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.x, TensorFlow 2.x) are mentioned in the paper.
Experiment Setup No The paper is theoretical and focuses on convergence analysis. While it discusses properties like 'step size = O(1)', these are theoretical bounds rather than concrete hyperparameters for an empirical experiment. No specific details about training configurations or system-level settings for experimental reproduction are provided.