Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks
Authors: Guodong Zhang, James Martens, Roger B. Grosse
NeurIPS 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this work, we analyze for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss. We identify two conditions which guarantee efficient convergence from random initializations: (1) the Jacobian matrix (of network s output for all training cases with respect to the parameters) has full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer Re LU neural networks, we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. |
| Researcher Affiliation | Collaboration | Guodong Zhang1,2, James Martens3, Roger Grosse1,2 University of Toronto1, Vector Institute2, Deep Mind3 EMAIL, EMAIL |
| Pseudocode | No | No pseudocode or algorithm block was found. |
| Open Source Code | No | The paper does not provide an unambiguous statement or link to open-source code for the methodology described. |
| Open Datasets | No | Figure 1: Visualization of natural gradient update and gradient descent update in the output space (for a randomly initialized network). We take two classes (4 and 9) from MNIST [Le Cun et al., 1998] and generate the targets (denoted as star in the figure) by f(x) = x 0.5 + 0.3 N(0, I) where x 2 R2 is one-hot target. This is for an illustrative visualization, not a primary experimental setup, and no direct access information (link, DOI) is provided for the data used in this visualization. |
| Dataset Splits | No | The paper focuses on theoretical analysis and proofs, not empirical experiments with specified training, validation, and test dataset splits. |
| Hardware Specification | No | No specific hardware (e.g., GPU models, CPU types, memory) used for any computations or visualizations is mentioned in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.x, TensorFlow 2.x) are mentioned in the paper. |
| Experiment Setup | No | The paper is theoretical and focuses on convergence analysis. While it discusses properties like 'step size = O(1)', these are theoretical bounds rather than concrete hyperparameters for an empirical experiment. No specific details about training configurations or system-level settings for experimental reproduction are provided. |