On the Adequacy of Untuned Warmup for Adaptive Optimization

Authors: Jerry Ma, Denis Yarats8828-8836

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate untuned exponential warmup (Equation 13), untuned linear warmup (Equation 14), and RAdam across a variety of supervised machine learning tasks. For brevity, all experimental settings are summarized in the main text and comprehensively detailed in Appendix A. 5.1 Image Classification Using each of the three warmup methods, we train a Res Net-50 model (He et al. 2016) on the ILSVRC ( Image Net ) image classification dataset with various configurations of Adam. [...] Table 1 presents the top-1 error rates at the end of training for the three warmup methods.
Researcher Affiliation Collaboration Jerry Ma, 1 2 Denis Yarats 3 4 1 Booth School of Business, University of Chicago 2 U.S. Patent and Trademark Office, Department of Commerce 3 Courant Institute of Mathematical Sciences, New York University 4 Facebook AI Research
Pseudocode No The paper provides mathematical equations for optimization algorithms (Eqs. 1-11) but no explicitly labeled "Pseudocode" or "Algorithm" block.
Open Source Code No No explicit statement or link for the open-source code for their methodology is provided.
Open Datasets Yes We train a Res Net-50 model (He et al. 2016) on the ILSVRC ( Image Net ) image classification dataset, EMNIST digit recognition task (Cohen et al. 2017), a state-of-the-art Transformer-based language model from Baevski and Auli (2018) on WIKITEXT-103, and a Transformer model (Vaswani et al. 2017) on the WMT16 English-German ( EN-DE ) dataset. These are all standard, publicly available datasets.
Dataset Splits No Appendix C.1 provides both training and validation metrics (Figures 7 and 8 respectively) for all tested configurations, reinforcing this trend. While validation is used, explicit percentages or sample counts for train/validation/test splits are not stated in the main text.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for running experiments.
Software Dependencies No The paper mentions "PyTorch Examples" and "Automatic differentiation in PyTorch" in the references but does not specify version numbers for any software dependencies used in the experiments.
Experiment Setup Yes We train a Res Net-50 model (He et al. 2016) on the ILSVRC ( Image Net ) image classification dataset with various configurations of Adam. Specifically, we sweep over: α (learning rate) 10 4, 10 3, 10 2 β2 {0.99, 0.997, 0.999} and We sweep over the following grid of Adam hyperparmeters: α (learning rate) 1 10 4, 3 10 4, 5 10 4 β2 {0.99, 0.998, 0.999} with β1 = 0.9 and ϵ = 10 7 fixed.