An Improved Analysis of Training Over-parameterized Deep Neural Networks

Authors: Difan Zou, Quanquan Gu

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for overparameterized (i.e., sufficiently wide) deep neural networks. However, the condition on the width of the neural network to ensure the global convergence is very stringent, which is often a high-degree polynomial in the training sample size n (e.g., O(n24)). In this paper, we provide an improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters. The main technical contributions of our analysis include (a) a tighter gradient lower bound that leads to a faster convergence of the algorithm, and (b) a sharper characterization of the trajectory length of the algorithm.
Researcher Affiliation Academia Difan Zou Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 knowzou@cs.ucla.edu Quanquan Gu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 qgu@cs.ucla.edu
Pseudocode Yes Algorithm 1 (Stochastic) Gradient descent with Gaussian random initialization
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper defines properties of "training data points" (e.g., "For any xi, it holds that xi 2 = 1 and (xi)d = µ" in Assumption 3.1) and refers to the "training sample size n." However, it does not use or provide access information for any specific named public datasets (e.g., MNIST, CIFAR-10) with citations or links.
Dataset Splits No The paper does not provide specific details on training, validation, or test dataset splits. The research is theoretical and does not involve empirical data splitting for experiments.
Hardware Specification No The paper is theoretical and does not discuss any specific hardware used for running experiments.
Software Dependencies No The paper is theoretical and does not list any specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed for replication.
Experiment Setup No The paper defines the "step size η" within its theoretical algorithm description. However, it does not provide concrete hyperparameter values, training configurations, or system-level settings typically found in an experimental setup section for empirical evaluation.