Optimal Regularization can Mitigate Double Descent

Authors: Preetum Nakkiran, Prayaag Venkat, Sham M. Kakade, Tengyu Ma

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our Contributions: We study this question from both a theoretical and empirical perspective. Theoretically, we prove that for certain linear regression models with isotropic data distribution, optimally-tuned ℓ2 regularization achieves monotonic test performance as we grow either the sample size or the model size. We also demonstrate empirically that optimally-tuned ℓ2 regularization can mitigate double descent for more general models, including neural networks.
Researcher Affiliation Collaboration Preetum Nakkiran Harvard University preetum@cs.harvard.edu Prayaag Venkat Harvard University pvenkat@g.harvard.edu Sham Kakade Microsoft Research & University of Washington sham@cs.washington.edu Tengyu Ma Stanford University tengyuma@stanford.edu
Pseudocode No The paper contains mathematical derivations, proofs, and lemmas, but no explicitly labeled pseudocode or algorithm blocks are present.
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes In Appendix A.4, we apply random features to Fashion-MNIST Xiao et al. (2017). ... We train and test on CIFAR100 (Krizhevsky et al., 2009).
Dataset Splits No Here, a natural strategy would be to use a regularizer and tune its strength on a validation set.
Hardware Specification No The paper does not provide specific details regarding the hardware specifications (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions training using Stochastic Gradient Descent (SGD) but does not provide specific software dependencies with version numbers (e.g., programming language versions, library versions like PyTorch, TensorFlow, etc.).
Experiment Setup Yes All models are trained using Stochastic Gradient Descent (SGD) on the cross-entropy loss, with step size 0.1/ p T/512 + 1 at step T. We train for 1e6 gradient steps, and use weight decay λ for varying λ. Due to optimization instabilities for large λ, we use the model with the minimum train loss among the last 5K gradient steps.