Benign Overfitting in Deep Neural Networks under Lazy Training

Authors: Zhenyu Zhu, Fanghui Liu, Grigorios Chrysos, Francesco Locatello, Volkan Cevher

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This paper focuses on over-parameterized deep neural networks (DNNs) with Re LU activation functions and proves that when the data distribution is well-separated, DNNs can achieve Bayesoptimal test error for classification while obtaining (nearly) zero-training error under the lazy training regime. For this purpose, we unify three interrelated concepts of overparameterization, benign overfitting, and the Lipschitz constant of DNNs. Our results indicate that interpolating with smoother functions leads to better generalization. Furthermore, we investigate the special case where interpolating smooth ground-truth functions is performed by DNNs under the Neural Tangent Kernel (NTK) regime for generalization. Our result demonstrates that the generalization error converges to a constant order that only depends on label noise and initialization noise, which theoretically verifies benign overfitting. Our analysis provides a tight lower bound on the normalized margin under non-smooth activation functions, as well as the minimum eigenvalue of NTK under high-dimensional settings, which has its own interest in learning theory.
Researcher Affiliation Collaboration 1Laboratory for Information and Inference Systems, Ecole Polytechnique F ed erale de Lausanne (EPFL), Switzerland 2Amazon Web Services (Work done outside of Amazon).
Pseudocode Yes Algorithm 1 SGD for training DNNs
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository for the methodology described.
Open Datasets Yes We also empirically verify our assumption on MNIST (Lecun et al., 1998) with ten digits from 0 to 9.
Dataset Splits No The paper discusses 'training data' and 'test error' but does not provide specific details on how the dataset was split into training, validation, and test sets (e.g., percentages, sample counts, or specific predefined splits with citations).
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running experiments or computations are mentioned in the paper.
Software Dependencies No No specific software dependencies with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9') are mentioned in the paper.
Experiment Setup Yes Given a DNN defined by Eq. (1) and trained by Algorithm 1 with a step size α L 2(log m) 5/2. Then under Assumption 1 and 2, for ω O(L 9/2(log m) 3) and λ > 0, with probability at least 1 O(n L2) exp( Ω(mω2/3L))... We use NTK initialization (Allen-Zhu et al., 2019b) in this section, but the main result can be easily extended to more initializations, such as He (He et al., 2015) and Le Cun (Le Cun et al., 2012).