Benign Overfitting in Deep Neural Networks under Lazy Training
Authors: Zhenyu Zhu, Fanghui Liu, Grigorios Chrysos, Francesco Locatello, Volkan Cevher
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This paper focuses on over-parameterized deep neural networks (DNNs) with Re LU activation functions and proves that when the data distribution is well-separated, DNNs can achieve Bayesoptimal test error for classification while obtaining (nearly) zero-training error under the lazy training regime. For this purpose, we unify three interrelated concepts of overparameterization, benign overfitting, and the Lipschitz constant of DNNs. Our results indicate that interpolating with smoother functions leads to better generalization. Furthermore, we investigate the special case where interpolating smooth ground-truth functions is performed by DNNs under the Neural Tangent Kernel (NTK) regime for generalization. Our result demonstrates that the generalization error converges to a constant order that only depends on label noise and initialization noise, which theoretically verifies benign overfitting. Our analysis provides a tight lower bound on the normalized margin under non-smooth activation functions, as well as the minimum eigenvalue of NTK under high-dimensional settings, which has its own interest in learning theory. |
| Researcher Affiliation | Collaboration | 1Laboratory for Information and Inference Systems, Ecole Polytechnique F ed erale de Lausanne (EPFL), Switzerland 2Amazon Web Services (Work done outside of Amazon). |
| Pseudocode | Yes | Algorithm 1 SGD for training DNNs |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository for the methodology described. |
| Open Datasets | Yes | We also empirically verify our assumption on MNIST (Lecun et al., 1998) with ten digits from 0 to 9. |
| Dataset Splits | No | The paper discusses 'training data' and 'test error' but does not provide specific details on how the dataset was split into training, validation, and test sets (e.g., percentages, sample counts, or specific predefined splits with citations). |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running experiments or computations are mentioned in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9') are mentioned in the paper. |
| Experiment Setup | Yes | Given a DNN defined by Eq. (1) and trained by Algorithm 1 with a step size α L 2(log m) 5/2. Then under Assumption 1 and 2, for ω O(L 9/2(log m) 3) and λ > 0, with probability at least 1 O(n L2) exp( Ω(mω2/3L))... We use NTK initialization (Allen-Zhu et al., 2019b) in this section, but the main result can be easily extended to more initializations, such as He (He et al., 2015) and Le Cun (Le Cun et al., 2012). |