Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Quasi-hyperbolic momentum and Adam for deep learning

Authors: Jerry Ma, Denis Yarats

ICLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that our algorithms lead to significantly improved training in a variety of settings, including a new state-of-the-art result on WMT16 EN-DE.
Researcher Affiliation Collaboration Jerry Ma Facebook AI Research Menlo Park, CA, USA EMAIL Denis Yarats Facebook AI Research & New York University New York, NY, USA EMAIL
Pseudocode No The paper describes update rules using mathematical equations but does not provide structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code is immediately available. 1 https://github.com/facebookresearch/qhoptim/
Open Datasets Yes EMNIST digits (Cohen et al., 2017), CIFAR10 (Krizhevsky, 2009), ILSVRC2012 (Russakovsky et al., 2015), Wiki Text-103 (Merity et al., 2016), Mu Jo Co (Todorov et al., 2012), WMT16 EN-DE (Vaswani et al., 2017; Ott et al., 2018)
Dataset Splits Yes We train for 90 epochs with size-64 minibatches. Each parameterization is run 3 times with different seeds, and we report training loss, training top-1 error, and validation top-1 error. We use a step decay schedule for the learning rate: α {1, 0.1, 0.01}. That is, the first 30 epochs use α = 1.0, the next 30 epochs use α = 0.1, and the final 30 epochs use α = 0.01.
Hardware Specification Yes Experiments are run on a mix of NVIDIA P100 and V100 GPUs
Software Dependencies Yes All experiments use Python 3.7 and Py Torch 0.4.1 (Paszke et al., 2017). Experiments are run on a mix of NVIDIA P100 and V100 GPUs, along with a mix of CUDA 9.0 and 9.2.
Experiment Setup Yes We train for 90 epochs with size-64 minibatches. For QHM, we initialize α = 1 and decay it 10-fold every 30 epochs. The sweep grid for QHM... For QHAdam, we fix α = 10 3, ϵ = 10 8, ν2 = 1, and β2 = 0.999, and sweep over ν1 and β1.