Width Provably Matters in Optimization for Deep Linear Neural Networks

Authors: Simon Du, Wei Hu

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We prove that for an L-layer fully-connected linear neural network, if the width of every hidden layer is e L r dout 3#, where r and are the rank and the condition number of the input data, and dout is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an -suboptimal solution is O( log( 1 )). Our polynomial upper bound on the total running time for wide deep linear networks and the exp ( (L)) lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models.
Researcher Affiliation Academia 1Carnegie Mellon University, Pittsburgh, PA, USA 2Princeton University, Princeton, NJ, USA.
Pseudocode No The paper contains mathematical derivations and theoretical analyses but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statements about releasing code or providing a link to source code.
Open Datasets No The paper is purely theoretical and does not use or reference any specific datasets for training experiments. It refers to 'input data X' as a mathematical construct.
Dataset Splits No The paper is theoretical and does not describe any dataset splits for validation or other purposes, as it does not conduct experiments.
Hardware Specification No The paper is theoretical and does not describe any hardware specifications used for experiments.
Software Dependencies No The paper is theoretical and does not list any software dependencies with specific version numbers, as it does not describe experimental implementations.
Experiment Setup No The paper is theoretical and does not describe any specific experimental setup details such as hyperparameters or training settings, as it does not conduct experiments.