Win: Weight-Decay-Integrated Nesterov Acceleration for Adaptive Gradient Algorithms
Authors: Pan Zhou, Xingyu Xie, Shuicheng YAN
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results testify to the faster convergence speed and superior performance of our Win-accelerated Adam W, Adam, LAMB and SGD over their non-accelerated counterparts on vision classification tasks and language modeling tasks with both CNN and Transformer backbones. |
| Researcher Affiliation | Collaboration | Pan Zhou1 Xingyu Xie2,1 Shuicheng Yan1 1Sea AI Lab 2National Key Lab of General AI, School of Intelligence Science and Technology, Peking University |
| Pseudocode | Yes | Algorithm 1: Win-Accelerated Adam W, Adam and LAMB |
| Open Source Code | Yes | Code will be released at https://github.com/sail-sg/win. |
| Open Datasets | Yes | For vision tasks, we test accelerated algorithms on both CNNs, e.g. Res Net (He et al., 2016), and vision transformers (Vi Ts), e.g. Vi T (Dosovitskiy et al., 2020) and Pool Former (Yu et al., 2021; 2022). For language modeling tasks, we use LSTM (Schmidhuber et al., 1997) and Transformer-XL (Dai et al., 2019) for evaluation. ... evaluate our accelerated algorithms on Image Net (Fei-Fei, 2009). ... on the Penn Tree Bank dataset (Marcinkiewicz, 1994) ... on the Wiki Text-103 dataset. |
| Dataset Splits | No | The paper implies the use of standard splits for datasets like ImageNet, but it does not explicitly provide percentages, sample counts, or specific citations for the train/validation/test splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | In all experiments, we do not change model architectures and data augmentations, and only replace the default optimizer with ours. ... their reckless step ηk always satisfies ηk = 2ηk. ... For warm-up epochs, for all four accelerated algorithms, we set it as 5.0. For base learning rate, we respectively set it as 3 × 10−3, 5 × 10−3, 3 × 10−3, and 1.2 for Adam W-Win, LAMB-Win, Adam-Win and SGD-Win. ... For weight decay, we respectively set it as 5 × 10−2, 5 × 10−2, 10−6, and 10−3 for Adam W-Win, LAMB-Win, Adam-Win and SGD-Win. On Res Net18, all algorithms are trained for 90 epochs with minibatch size 512 |