On the Adequacy of Untuned Warmup for Adaptive Optimization
Authors: Jerry Ma, Denis Yarats8828-8836
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate untuned exponential warmup (Equation 13), untuned linear warmup (Equation 14), and RAdam across a variety of supervised machine learning tasks. For brevity, all experimental settings are summarized in the main text and comprehensively detailed in Appendix A. 5.1 Image Classification Using each of the three warmup methods, we train a Res Net-50 model (He et al. 2016) on the ILSVRC ( Image Net ) image classification dataset with various configurations of Adam. [...] Table 1 presents the top-1 error rates at the end of training for the three warmup methods. |
| Researcher Affiliation | Collaboration | Jerry Ma, 1 2 Denis Yarats 3 4 1 Booth School of Business, University of Chicago 2 U.S. Patent and Trademark Office, Department of Commerce 3 Courant Institute of Mathematical Sciences, New York University 4 Facebook AI Research |
| Pseudocode | No | The paper provides mathematical equations for optimization algorithms (Eqs. 1-11) but no explicitly labeled "Pseudocode" or "Algorithm" block. |
| Open Source Code | No | No explicit statement or link for the open-source code for their methodology is provided. |
| Open Datasets | Yes | We train a Res Net-50 model (He et al. 2016) on the ILSVRC ( Image Net ) image classification dataset, EMNIST digit recognition task (Cohen et al. 2017), a state-of-the-art Transformer-based language model from Baevski and Auli (2018) on WIKITEXT-103, and a Transformer model (Vaswani et al. 2017) on the WMT16 English-German ( EN-DE ) dataset. These are all standard, publicly available datasets. |
| Dataset Splits | No | Appendix C.1 provides both training and validation metrics (Figures 7 and 8 respectively) for all tested configurations, reinforcing this trend. While validation is used, explicit percentages or sample counts for train/validation/test splits are not stated in the main text. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for running experiments. |
| Software Dependencies | No | The paper mentions "PyTorch Examples" and "Automatic differentiation in PyTorch" in the references but does not specify version numbers for any software dependencies used in the experiments. |
| Experiment Setup | Yes | We train a Res Net-50 model (He et al. 2016) on the ILSVRC ( Image Net ) image classification dataset with various configurations of Adam. Specifically, we sweep over: α (learning rate) 10 4, 10 3, 10 2 β2 {0.99, 0.997, 0.999} and We sweep over the following grid of Adam hyperparmeters: α (learning rate) 1 10 4, 3 10 4, 5 10 4 β2 {0.99, 0.998, 0.999} with β1 = 0.9 and ϵ = 10 7 fixed. |