reproducibilityindex.ai

Optimal Regularization can Mitigate Double Descent

Authors: Preetum Nakkiran, Prayaag Venkat, Sham M. Kakade, Tengyu Ma

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our Contributions: We study this question from both a theoretical and empirical perspective. Theoretically, we prove that for certain linear regression models with isotropic data distribution, optimally-tuned ℓ2 regularization achieves monotonic test performance as we grow either the sample size or the model size. We also demonstrate empirically that optimally-tuned ℓ2 regularization can mitigate double descent for more general models, including neural networks.
Researcher Affiliation	Collaboration	Preetum Nakkiran Harvard University preetum@cs.harvard.edu Prayaag Venkat Harvard University pvenkat@g.harvard.edu Sham Kakade Microsoft Research & University of Washington sham@cs.washington.edu Tengyu Ma Stanford University tengyuma@stanford.edu
Pseudocode	No	The paper contains mathematical derivations, proofs, and lemmas, but no explicitly labeled pseudocode or algorithm blocks are present.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	In Appendix A.4, we apply random features to Fashion-MNIST Xiao et al. (2017). ... We train and test on CIFAR100 (Krizhevsky et al., 2009).
Dataset Splits	No	Here, a natural strategy would be to use a regularizer and tune its strength on a validation set.
Hardware Specification	No	The paper does not provide specific details regarding the hardware specifications (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions training using Stochastic Gradient Descent (SGD) but does not provide specific software dependencies with version numbers (e.g., programming language versions, library versions like PyTorch, TensorFlow, etc.).
Experiment Setup	Yes	All models are trained using Stochastic Gradient Descent (SGD) on the cross-entropy loss, with step size 0.1/ p T/512 + 1 at step T. We train for 1e6 gradient steps, and use weight decay λ for varying λ. Due to optimization instabilities for large λ, we use the model with the minimum train loss among the last 5K gradient steps.