Width Provably Matters in Optimization for Deep Linear Neural Networks
Authors: Simon Du, Wei Hu
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We prove that for an L-layer fully-connected linear neural network, if the width of every hidden layer is e L r dout 3#, where r and are the rank and the condition number of the input data, and dout is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an -suboptimal solution is O( log( 1 )). Our polynomial upper bound on the total running time for wide deep linear networks and the exp ( (L)) lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models. |
| Researcher Affiliation | Academia | 1Carnegie Mellon University, Pittsburgh, PA, USA 2Princeton University, Princeton, NJ, USA. |
| Pseudocode | No | The paper contains mathematical derivations and theoretical analyses but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statements about releasing code or providing a link to source code. |
| Open Datasets | No | The paper is purely theoretical and does not use or reference any specific datasets for training experiments. It refers to 'input data X' as a mathematical construct. |
| Dataset Splits | No | The paper is theoretical and does not describe any dataset splits for validation or other purposes, as it does not conduct experiments. |
| Hardware Specification | No | The paper is theoretical and does not describe any hardware specifications used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not list any software dependencies with specific version numbers, as it does not describe experimental implementations. |
| Experiment Setup | No | The paper is theoretical and does not describe any specific experimental setup details such as hyperparameters or training settings, as it does not conduct experiments. |