Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks

Authors: Spencer Frei, Yuan Cao, Quanquan Gu

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this work, we analyze overparameterized deep residual networks trained by gradient descent following random initialization, and demonstrate that (i) the class of networks learned by gradient descent constitutes a small subset of the entire neural network function class, and (ii) this subclass of networks is sufficiently large to guarantee small training error. By showing (i) we are able to demonstrate that deep residual networks trained with gradient descent have a small generalization gap between training and test error, and together with (ii) this guarantees that the test error will be small. Our optimization and generalization guarantees require overparameterization that is only logarithmic in the depth of the network, while all known generalization bounds for deep non-residual networks have overparameterization requirements that are at least polynomial in the depth. This provides an explanation for why residual networks are preferable to non-residual ones.
Researcher Affiliation Academia Department of Statistics, University of California, Los Angeles, CA 90095, USA; e-mail: spencerfrei@ucla.edu Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: yuancao@cs.ucla.edu Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: qgu@cs.ucla.edu
Pseudocode No The paper describes the gradient descent optimization process using mathematical formulas and text (e.g., 'min W LS(W) := 1/n sum_{i=1 to n} ℓ(yi f W (xi)), W (k+1) l = W (k) l - η ∇_{Wl} LS(W (k)) (l in [L + 1])'), but it does not include a dedicated section or block explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code No The paper does not provide any links to source code repositories, nor does it explicitly state that code for the methodology is released.
Open Datasets No The paper assumes i.i.d. samples from a theoretical distribution D ('We assume we have i.i.d. samples (xi, yi)n i=1 D from a distribution D, where xi Rd and yi in {-1, 1}'), and makes assumptions about its properties (Assumption 3.2), but it does not reference or provide access to a specific public dataset for training.
Dataset Splits No The paper focuses on theoretical analysis and does not describe empirical experiments or specific train/validation/test data splits.
Hardware Specification No The paper is theoretical and does not specify any hardware used for experiments.
Software Dependencies No The paper is theoretical and does not specify any software dependencies with version numbers for experimental setup.
Experiment Setup No The paper defines parameters and assumptions for its theoretical model (e.g., 'constant step size gradient descent', 'Gaussian initialization', 'residual scaling parameter θ = 1/Ω(L)', 'width m'), but it does not describe an empirical experimental setup with hyperparameters for reproduction of practical experiments, as the paper is theoretical.