On the Variance of the Adaptive Learning Rate and Beyond
Authors: Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the efficacy and robustness of RAdam. |
| Researcher Affiliation | Collaboration | University of Illinois, Urbana-Champaign; Georgia Tech; Microsoft Dynamics 365 AI; Microsoft Research |
| Pseudocode | Yes | Algorithm 1: Generic adaptive optimization method setup. and Algorithm 2: Rectified Adam. |
| Open Source Code | Yes | All implementations are available at: https://github.com/LiyuanLucasLiu/RAdam. |
| Open Datasets | Yes | IWSLT 14 German to English translation dataset (Cettolo et al., 2014); One Billion Word (Chelba et al., 2013)); Cifar10 (Krizhevsky et al., 2009) and Image Net (Deng et al., 2009)) |
| Dataset Splits | No | The paper mentions using standard datasets like CIFAR-10, ImageNet, IWSLT 14, and WMT 16, which have well-defined splits, but it does not explicitly state the training, validation, and test split percentages or sample counts within the paper's text. |
| Hardware Specification | Yes | All models are trained on one NVIDIA Tesla V100 GPU.; we conduct training on one NVIDIA Tesla V100 GPU; we conduct training on four NVIDIA Quadro R8000 GPUs |
| Software Dependencies | No | The paper mentions using a 'public pytorch re-implementation' and the 'fairseq package' but does not specify their version numbers or any other software dependencies with version information. |
| Experiment Setup | Yes | For Adam and RAdam, we set β1 = 0.9, β2 = 0.999. For SGD, we set the momentum factor as 0.9. The weight decay rate is 10^-4. Random cropping and random horizontal flipping are applied to training data. |