The Marginal Value of Momentum for Small Learning Rate SGD

Authors: Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, Zhiyuan Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including smallto medium-batch training from scratch on Image Net and finetuning language models on downstream tasks.
Researcher Affiliation Collaboration 1Princeton University, 2Yale University, 3Stanford University, 4Toyota Technological Institute at Chicago
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper states, 'We fine-tune Ro BERTa-large (Liu et al., 2019) on several tasks using the code provided by Malladi et al. (2023),' indicating use of third-party code rather than providing their own.
Open Datasets Yes Image Net Experiments. First, we train Res Net-50 on Image Net across batch sizes... Language Model Experiments. In fine-tuning a pre-trained model, a small learning rate is also preferable to retain the model s knowledge learned during pre-training. Indeed, we observe that SGD and SGDM behave similarly in this case. We fine-tune Ro BERTa-large (Liu et al., 2019) on 5 diverse tasks (SST-2 (Socher et al., 2013), SST-5 (Socher et al., 2013), SNLI (Bowman et al., 2015), TREC (Voorhees and Tice, 2000), and MNLI (Williams et al., 2018)).
Dataset Splits Yes We follow the few shot setting described in (Gao et al., 2021; Malladi et al., 2023), using a grid for SGD based on (Malladi et al., 2023) and sampling 512 examples per class (Table 1). Additional settings and trajectories are in Appendix E. Results are averaged over 5 random subsets of the full dataset.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only generally describes the training process.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers). It mentions 'pytorch (Paszke et al., 2019)' but does not specify the version used for their experiments.
Experiment Setup Yes For SGDM (1), we use the default value of β = 0.9, and grid search for the best learning rate γ over 0.1 2k (k Z)... We fine-tune for 4 epochs with batch sizes 2, 4, and 8 and learning rates 1e 4, 1e 3, and 1e 2... We use batch size B = 512 with two learning rate decays by a factor of 0.1 at epochs 80 and 120. We grid search to find the best learning rate for SGDM (η = 0.2) and then use it to run SGD and SGDM with SVAG. We use β = 0.9 for SGDM.