Quasi-hyperbolic momentum and Adam for deep learning
Authors: Jerry Ma, Denis Yarats
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that our algorithms lead to significantly improved training in a variety of settings, including a new state-of-the-art result on WMT16 EN-DE. |
| Researcher Affiliation | Collaboration | Jerry Ma Facebook AI Research Menlo Park, CA, USA maj@fb.com Denis Yarats Facebook AI Research & New York University New York, NY, USA denisy@fb.com |
| Pseudocode | No | The paper describes update rules using mathematical equations but does not provide structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code is immediately available. 1 https://github.com/facebookresearch/qhoptim/ |
| Open Datasets | Yes | EMNIST digits (Cohen et al., 2017), CIFAR10 (Krizhevsky, 2009), ILSVRC2012 (Russakovsky et al., 2015), Wiki Text-103 (Merity et al., 2016), Mu Jo Co (Todorov et al., 2012), WMT16 EN-DE (Vaswani et al., 2017; Ott et al., 2018) |
| Dataset Splits | Yes | We train for 90 epochs with size-64 minibatches. Each parameterization is run 3 times with different seeds, and we report training loss, training top-1 error, and validation top-1 error. We use a step decay schedule for the learning rate: α {1, 0.1, 0.01}. That is, the first 30 epochs use α = 1.0, the next 30 epochs use α = 0.1, and the final 30 epochs use α = 0.01. |
| Hardware Specification | Yes | Experiments are run on a mix of NVIDIA P100 and V100 GPUs |
| Software Dependencies | Yes | All experiments use Python 3.7 and Py Torch 0.4.1 (Paszke et al., 2017). Experiments are run on a mix of NVIDIA P100 and V100 GPUs, along with a mix of CUDA 9.0 and 9.2. |
| Experiment Setup | Yes | We train for 90 epochs with size-64 minibatches. For QHM, we initialize α = 1 and decay it 10-fold every 30 epochs. The sweep grid for QHM... For QHAdam, we fix α = 10 3, ϵ = 10 8, ν2 = 1, and β2 = 0.999, and sweep over ν1 and β1. |