On the Variance of the Adaptive Learning Rate and Beyond

Authors: Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the efficacy and robustness of RAdam.
Researcher Affiliation Collaboration University of Illinois, Urbana-Champaign; Georgia Tech; Microsoft Dynamics 365 AI; Microsoft Research
Pseudocode Yes Algorithm 1: Generic adaptive optimization method setup. and Algorithm 2: Rectified Adam.
Open Source Code Yes All implementations are available at: https://github.com/LiyuanLucasLiu/RAdam.
Open Datasets Yes IWSLT 14 German to English translation dataset (Cettolo et al., 2014); One Billion Word (Chelba et al., 2013)); Cifar10 (Krizhevsky et al., 2009) and Image Net (Deng et al., 2009))
Dataset Splits No The paper mentions using standard datasets like CIFAR-10, ImageNet, IWSLT 14, and WMT 16, which have well-defined splits, but it does not explicitly state the training, validation, and test split percentages or sample counts within the paper's text.
Hardware Specification Yes All models are trained on one NVIDIA Tesla V100 GPU.; we conduct training on one NVIDIA Tesla V100 GPU; we conduct training on four NVIDIA Quadro R8000 GPUs
Software Dependencies No The paper mentions using a 'public pytorch re-implementation' and the 'fairseq package' but does not specify their version numbers or any other software dependencies with version information.
Experiment Setup Yes For Adam and RAdam, we set β1 = 0.9, β2 = 0.999. For SGD, we set the momentum factor as 0.9. The weight decay rate is 10^-4. Random cropping and random horizontal flipping are applied to training data.