reproducibilityindex.ai

Deep Double Descent: Where Bigger Models and More Data Hurt

Authors: Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that a variety of modern deep learning tasks exhibit a double-descent phenomenon where, as we increase model size, performance ﬁrst gets worse and then gets better. Moreover, we show that double descent occurs not just as a function of model size, but also as a function of the number of training epochs. We unify the above phenomena by deﬁning a new complexity measure we call the effective model complexity and conjecture a generalized double descent with respect to this measure. Furthermore, our notion of model complexity allows us to identify certain regimes where increasing (even quadrupling) the number of train samples actually hurts test performance.
Researcher Affiliation	Collaboration	Preetum Nakkiran Harvard University Gal Kaplun Harvard University Yamini Bansal Harvard University Tristan Yang Harvard University Boaz Barak Harvard University Ilya Sutskever Open AI
Pseudocode	No	The paper describes methods in text and provides figures of results, but it does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	The Py Torch Paszke et al. (2017) speciﬁcation of our Res Nets and CNNs are available at https://gitlab.com/ harvard-machine-learning/double-descent/tree/master.
Open Datasets	Yes	For Res Nets and CNNs, we train with cross-entropy loss, and the following optimizers: (1) Adam with learning-rate 0.0001 for 4K epochs; (2) SGD with learning rate 1/T for 500K gradient steps. We train Transformers for 80K gradient steps, with 10% label smoothing and no drop-out. The paper extensively uses well-known public datasets such as CIFAR-10, CIFAR-100, IWSLT 14 German to English, and WMT 14 English to French.
Dataset Splits	No	The paper discusses train and test errors but does not specify explicit dataset splits for training, validation, and testing (e.g., percentages or counts) or cross-validation details.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper mentions "PyTorch" as a framework but does not specify any version numbers for PyTorch or other software dependencies.
Experiment Setup	Yes	Loss function: Unless stated otherwise, we use the cross-entropy loss for all the experiments. Data-augmentation: In experiments where data-augmentation was used, we apply Random Crop(32, padding=4) and Random Horizontal Flip. ... Adam: Unless speciﬁed otherwise, learning rate was set at constant to 1e-4 and all other parameters were set to their default Py Torch values. SGD: Unless speciﬁed otherwise, learning rate schedule inverse-square root (deﬁned below) was used with initial learning rate γ0 = 0.1 and updates every L = 512 gradient steps. No momentum was used. Batch size: All experiments use a batchsize of 128.