Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training

Authors: Pan Zhou, Xingyu Xie, Zhouchen Lin, Kim-Chuan Toh, Shuicheng Yan

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrates the faster convergence speed and superior performance of our Win and Win2-accelerated Adam W, Adam, LAMB and SGD over their vanilla counterparts on vision classification and language modeling tasks. Keywords: Accelerated Adaptive Gradient Algorithms, Deep Learning Optimizer, Network Optimization, Nesterov Acceleration in Deep Learning
Researcher Affiliation Collaboration Pan Zhou EMAIL School of Computing and Information Systems, Singapore Management University, Singapore. Xingyu Xie EMAIL National Key Lab of General AI, School of Intelligence Science and Technology, Peking University, China. Zhouchen Lin EMAIL National Key Lab of General AI, School of Intelligence Science and Technology, Peking University, China Institute for Artificial Intelligence, Peking University, China Peng Cheng Laboratory, China. Kim-Chuan Toh EMAIL Department of Mathematics and Institute of Operations Research and Analytics, National University of Singapore, Singapore. Shuicheng Yan EMAIL Skywork AI
Pseudocode Yes Algorithm 1: Win-Accelerated Adam W, Adam, LAMB and SGD. Algorithm 2: Win2-Accelerated Adam W, Adam, LAMB and SGD
Open Source Code Yes 1. Code is released at https://github.com/sail-sg/win.
Open Datasets Yes We defer the hyper-parameter settings of the four accelerated algorithms in Table 1 into Appendix A. Results on Res Net18. Here we follow the conventional supervised training setting commonly used in Res Nets (He et al., 2016) and evaluate our accelerated algorithms on the Image Net dataset (Fei-Fei, 2009). Results on Instance Segmentation. Here we evaluate our Win and Win2-accelerated algorithms on the instance segmentation task... For evaluation, we employ the widely used large-scale COCO dataset (Lin et al., 2014) for evaluation and adopt Mask R-CNN (He et al., 2017) framework with the Swin transformer (Liu et al., 2021) as the backbone. Results on LSTM. We follow Ada Belief to test our accelerated algorithms via training three-layered LSTM (Hochreiter and Schmidhuber, 1997) on the Penn Tree Bank dataset (Marcinkiewicz, 1994) for 200 epochs. Results on Transformer-XL. We adopt a widely used language sequence model, i.e. Transformer XL (Dai et al., 2019)...on the Wiki Text-103 dataset.
Dataset Splits Yes Results on Res Net18. Here we follow the conventional supervised training setting commonly used in Res Nets (He et al., 2016) and evaluate our accelerated algorithms on the Image Net dataset (Fei-Fei, 2009). Results on Res Net50 & 101. Here we adopt the training setting in (Wightman et al., 2021) to train Res Net50 and Res Net101, because this setting uses stronger data augmentation and largely improves CNNs performance. Results on Vi Ts. We follow the widely used official training setting of Vi Ts (Touvron et al., 2021; Yu et al., 2022a). For evaluation, we employ the widely used large-scale COCO dataset (Lin et al., 2014) for evaluation and adopt Mask R-CNN (He et al., 2017) framework... For fairness, we adopt the setting in MMdection to test all the optimizers and train the models for 12 epochs.
Hardware Specification Yes We use both Adam W-Win and LAMB-Win to train Res Net18 for 90 epochs with minibatch size 512 on two A100 GPUs.
Software Dependencies No The paper mentions using Python for implementation (implicitly, as it's common for deep learning and PyTorch is often used with it), but does not specify Python version, nor any specific versions for libraries like PyTorch or MMDetection, which are implicitly referenced through citations or method names.
Experiment Setup Yes In all experiments, we do not change model architectures and data augmentations, and only replace the default optimizer with ours. Moreover, for all experiments, our accelerated algorithms, e.g., Adam W-Win and Adam W-Win2, always use the default optimizer-inherent hyper-parameters of the vanilla optimizers, e.g., firstand second-order moment parameters β1 and β2 in Adam W; and their aggressive steps ηy k and ηz k always satisfies ηy k =2ηx k and ηz k =8ηx k. These settings well reduce the parameter-tuning cost of our algorithms. In the experiments, same with other optimizers, we only slightly tune other widely tuned hyper-parameters around the default ones used in the vanilla optimizers, e.g., stepsize and warm-up epochs. Appendix A. More Experimental Details: For Win-accelerated algorithms... its aggressive step ηy k is always 2 larger than its conservative step ηx k for all iterations, i.e. ηy k = 2ηx k. For Win2-accelerated Adam W, Adam, SGD and LAMB, we also always set ηy k = 2ηx k and ηz k = 8ηx k. For all Winand Win2-accelerated optimizers, their firstand second-order moment parameters β1 and β2 are set to the default values β1 = 0.9 and β2 = 0.999... For warm-up epochs, for all four accelerated algorithms, we set it as 5.0. For base learning rate, we respectively set it as 3 10 3, 5 10 3, 3 10 3, and 1.2 for Adam W-Win, LAMB-Win, Adam-Win and SGD-Win. Moreover, we follow the default setting and use cosine learning rate decay. For weight decay, we respectively set it as 5 10 2, 5 10 2, 10 6, and 10 3 for Adam W-Win, LAMB-Win, Adam-Win and SGD-Win. On Res Net18, all algorithms are trained for 90 epochs with minibatch size 512 by following the conventional setting.