Deep Double Descent: Where Bigger Models and More Data Hurt
Authors: Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that a variety of modern deep learning tasks exhibit a double-descent phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show that double descent occurs not just as a function of model size, but also as a function of the number of training epochs. We unify the above phenomena by defining a new complexity measure we call the effective model complexity and conjecture a generalized double descent with respect to this measure. Furthermore, our notion of model complexity allows us to identify certain regimes where increasing (even quadrupling) the number of train samples actually hurts test performance. |
| Researcher Affiliation | Collaboration | Preetum Nakkiran Harvard University Gal Kaplun Harvard University Yamini Bansal Harvard University Tristan Yang Harvard University Boaz Barak Harvard University Ilya Sutskever Open AI |
| Pseudocode | No | The paper describes methods in text and provides figures of results, but it does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | The Py Torch Paszke et al. (2017) specification of our Res Nets and CNNs are available at https://gitlab.com/ harvard-machine-learning/double-descent/tree/master. |
| Open Datasets | Yes | For Res Nets and CNNs, we train with cross-entropy loss, and the following optimizers: (1) Adam with learning-rate 0.0001 for 4K epochs; (2) SGD with learning rate 1/T for 500K gradient steps. We train Transformers for 80K gradient steps, with 10% label smoothing and no drop-out. The paper extensively uses well-known public datasets such as CIFAR-10, CIFAR-100, IWSLT 14 German to English, and WMT 14 English to French. |
| Dataset Splits | No | The paper discusses train and test errors but does not specify explicit dataset splits for training, validation, and testing (e.g., percentages or counts) or cross-validation details. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions "PyTorch" as a framework but does not specify any version numbers for PyTorch or other software dependencies. |
| Experiment Setup | Yes | Loss function: Unless stated otherwise, we use the cross-entropy loss for all the experiments. Data-augmentation: In experiments where data-augmentation was used, we apply Random Crop(32, padding=4) and Random Horizontal Flip. ... Adam: Unless specified otherwise, learning rate was set at constant to 1e-4 and all other parameters were set to their default Py Torch values. SGD: Unless specified otherwise, learning rate schedule inverse-square root (defined below) was used with initial learning rate γ0 = 0.1 and updates every L = 512 gradient steps. No momentum was used. Batch size: All experiments use a batchsize of 128. |