Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Optimal Regularization can Mitigate Double Descent
Authors: Preetum Nakkiran, Prayaag Venkat, Sham M. Kakade, Tengyu Ma
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our Contributions: We study this question from both a theoretical and empirical perspective. Theoretically, we prove that for certain linear regression models with isotropic data distribution, optimally-tuned ℓ2 regularization achieves monotonic test performance as we grow either the sample size or the model size. We also demonstrate empirically that optimally-tuned ℓ2 regularization can mitigate double descent for more general models, including neural networks. |
| Researcher Affiliation | Collaboration | Preetum Nakkiran Harvard University EMAIL Prayaag Venkat Harvard University EMAIL Sham Kakade Microsoft Research & University of Washington EMAIL Tengyu Ma Stanford University EMAIL |
| Pseudocode | No | The paper contains mathematical derivations, proofs, and lemmas, but no explicitly labeled pseudocode or algorithm blocks are present. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | In Appendix A.4, we apply random features to Fashion-MNIST Xiao et al. (2017). ... We train and test on CIFAR100 (Krizhevsky et al., 2009). |
| Dataset Splits | No | Here, a natural strategy would be to use a regularizer and tune its strength on a validation set. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware specifications (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions training using Stochastic Gradient Descent (SGD) but does not provide specific software dependencies with version numbers (e.g., programming language versions, library versions like PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | All models are trained using Stochastic Gradient Descent (SGD) on the cross-entropy loss, with step size 0.1/ p T/512 + 1 at step T. We train for 1e6 gradient steps, and use weight decay λ for varying λ. Due to optimization instabilities for large λ, we use the model with the minimum train loss among the last 5K gradient steps. |