Quasi-hyperbolic momentum and Adam for deep learning

Authors: Jerry Ma, Denis Yarats

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that our algorithms lead to significantly improved training in a variety of settings, including a new state-of-the-art result on WMT16 EN-DE.
Researcher Affiliation Collaboration Jerry Ma Facebook AI Research Menlo Park, CA, USA maj@fb.com Denis Yarats Facebook AI Research & New York University New York, NY, USA denisy@fb.com
Pseudocode No The paper describes update rules using mathematical equations but does not provide structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code is immediately available. 1 https://github.com/facebookresearch/qhoptim/
Open Datasets Yes EMNIST digits (Cohen et al., 2017), CIFAR10 (Krizhevsky, 2009), ILSVRC2012 (Russakovsky et al., 2015), Wiki Text-103 (Merity et al., 2016), Mu Jo Co (Todorov et al., 2012), WMT16 EN-DE (Vaswani et al., 2017; Ott et al., 2018)
Dataset Splits Yes We train for 90 epochs with size-64 minibatches. Each parameterization is run 3 times with different seeds, and we report training loss, training top-1 error, and validation top-1 error. We use a step decay schedule for the learning rate: α {1, 0.1, 0.01}. That is, the first 30 epochs use α = 1.0, the next 30 epochs use α = 0.1, and the final 30 epochs use α = 0.01.
Hardware Specification Yes Experiments are run on a mix of NVIDIA P100 and V100 GPUs
Software Dependencies Yes All experiments use Python 3.7 and Py Torch 0.4.1 (Paszke et al., 2017). Experiments are run on a mix of NVIDIA P100 and V100 GPUs, along with a mix of CUDA 9.0 and 9.2.
Experiment Setup Yes We train for 90 epochs with size-64 minibatches. For QHM, we initialize α = 1 and decay it 10-fold every 30 epochs. The sweep grid for QHM... For QHAdam, we fix α = 10 3, ϵ = 10 8, ν2 = 1, and β2 = 0.999, and sweep over ν1 and β1.