Optimal Regularization can Mitigate Double Descent
Authors: Preetum Nakkiran, Prayaag Venkat, Sham M. Kakade, Tengyu Ma
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our Contributions: We study this question from both a theoretical and empirical perspective. Theoretically, we prove that for certain linear regression models with isotropic data distribution, optimally-tuned ℓ2 regularization achieves monotonic test performance as we grow either the sample size or the model size. We also demonstrate empirically that optimally-tuned ℓ2 regularization can mitigate double descent for more general models, including neural networks. |
| Researcher Affiliation | Collaboration | Preetum Nakkiran Harvard University preetum@cs.harvard.edu Prayaag Venkat Harvard University pvenkat@g.harvard.edu Sham Kakade Microsoft Research & University of Washington sham@cs.washington.edu Tengyu Ma Stanford University tengyuma@stanford.edu |
| Pseudocode | No | The paper contains mathematical derivations, proofs, and lemmas, but no explicitly labeled pseudocode or algorithm blocks are present. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | In Appendix A.4, we apply random features to Fashion-MNIST Xiao et al. (2017). ... We train and test on CIFAR100 (Krizhevsky et al., 2009). |
| Dataset Splits | No | Here, a natural strategy would be to use a regularizer and tune its strength on a validation set. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware specifications (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions training using Stochastic Gradient Descent (SGD) but does not provide specific software dependencies with version numbers (e.g., programming language versions, library versions like PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | All models are trained using Stochastic Gradient Descent (SGD) on the cross-entropy loss, with step size 0.1/ p T/512 + 1 at step T. We train for 1e6 gradient steps, and use weight decay λ for varying λ. Due to optimization instabilities for large λ, we use the model with the minimum train loss among the last 5K gradient steps. |