Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks
Authors: Spencer Frei, Yuan Cao, Quanquan Gu
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this work, we analyze overparameterized deep residual networks trained by gradient descent following random initialization, and demonstrate that (i) the class of networks learned by gradient descent constitutes a small subset of the entire neural network function class, and (ii) this subclass of networks is sufficiently large to guarantee small training error. By showing (i) we are able to demonstrate that deep residual networks trained with gradient descent have a small generalization gap between training and test error, and together with (ii) this guarantees that the test error will be small. Our optimization and generalization guarantees require overparameterization that is only logarithmic in the depth of the network, while all known generalization bounds for deep non-residual networks have overparameterization requirements that are at least polynomial in the depth. This provides an explanation for why residual networks are preferable to non-residual ones. |
| Researcher Affiliation | Academia | Department of Statistics, University of California, Los Angeles, CA 90095, USA; e-mail: spencerfrei@ucla.edu Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: yuancao@cs.ucla.edu Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: qgu@cs.ucla.edu |
| Pseudocode | No | The paper describes the gradient descent optimization process using mathematical formulas and text (e.g., 'min W LS(W) := 1/n sum_{i=1 to n} ℓ(yi f W (xi)), W (k+1) l = W (k) l - η ∇_{Wl} LS(W (k)) (l in [L + 1])'), but it does not include a dedicated section or block explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | The paper does not provide any links to source code repositories, nor does it explicitly state that code for the methodology is released. |
| Open Datasets | No | The paper assumes i.i.d. samples from a theoretical distribution D ('We assume we have i.i.d. samples (xi, yi)n i=1 D from a distribution D, where xi Rd and yi in {-1, 1}'), and makes assumptions about its properties (Assumption 3.2), but it does not reference or provide access to a specific public dataset for training. |
| Dataset Splits | No | The paper focuses on theoretical analysis and does not describe empirical experiments or specific train/validation/test data splits. |
| Hardware Specification | No | The paper is theoretical and does not specify any hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not specify any software dependencies with version numbers for experimental setup. |
| Experiment Setup | No | The paper defines parameters and assumptions for its theoretical model (e.g., 'constant step size gradient descent', 'Gaussian initialization', 'residual scaling parameter θ = 1/Ω(L)', 'width m'), but it does not describe an empirical experimental setup with hyperparameters for reproduction of practical experiments, as the paper is theoretical. |