Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks
Authors: Guodong Zhang, James Martens, Roger B. Grosse
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this work, we analyze for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss. We identify two conditions which guarantee efficient convergence from random initializations: (1) the Jacobian matrix (of network s output for all training cases with respect to the parameters) has full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer Re LU neural networks, we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. |
| Researcher Affiliation | Collaboration | Guodong Zhang1,2, James Martens3, Roger Grosse1,2 University of Toronto1, Vector Institute2, Deep Mind3 {gdzhang, rgrosse}@cs.toronto.edu, jamesmartens@google.com |
| Pseudocode | No | No pseudocode or algorithm block was found. |
| Open Source Code | No | The paper does not provide an unambiguous statement or link to open-source code for the methodology described. |
| Open Datasets | No | Figure 1: Visualization of natural gradient update and gradient descent update in the output space (for a randomly initialized network). We take two classes (4 and 9) from MNIST [Le Cun et al., 1998] and generate the targets (denoted as star in the figure) by f(x) = x 0.5 + 0.3 N(0, I) where x 2 R2 is one-hot target. This is for an illustrative visualization, not a primary experimental setup, and no direct access information (link, DOI) is provided for the data used in this visualization. |
| Dataset Splits | No | The paper focuses on theoretical analysis and proofs, not empirical experiments with specified training, validation, and test dataset splits. |
| Hardware Specification | No | No specific hardware (e.g., GPU models, CPU types, memory) used for any computations or visualizations is mentioned in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.x, TensorFlow 2.x) are mentioned in the paper. |
| Experiment Setup | No | The paper is theoretical and focuses on convergence analysis. While it discusses properties like 'step size = O(1)', these are theoretical bounds rather than concrete hyperparameters for an empirical experiment. No specific details about training configurations or system-level settings for experimental reproduction are provided. |