reproducibilityindex.ai

AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix

Authors: Yun Yue, Zhiling Ye, Jiadi Jiang, Yongchao Liu, Ke Zhang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate AGD on public datasets of Natural Language Processing (NLP), Computer Vision (CV), and Recommendation Systems (Rec Sys). Our experimental results demonstrate that AGD outperforms the state-of-the-art (SOTA) optimizers, achieving highly competitive or signiﬁcantly better predictive performance.
Researcher Affiliation	Industry	Yun Yue Ant Group Hangzhou, Zhejiang, China yueyun.yy@antgroup.com Zhiling Ye Ant Group Hangzhou, Zhejiang, China yezhiling.yzl@antgroup.com Jiadi Jiang Ant Group Hangzhou, Zhejiang, China jiadi.jjd@antgroup.com Yongchao Liu Ant Group Hangzhou, Zhejiang, China yongchao.ly@antgroup.com Ke Zhang Ant Group Beijing, China yingzi.zk@antgroup.com
Pseudocode	Yes	Algorithm 1 AGD... Algorithm 2 AGD with AMSGrad condition
Open Source Code	Yes	The code is available at this link2. 2https://github.com/intelligent-machine-learning/dlrover/tree/master/atorch/ atorch/optimizers
Open Datasets	Yes	NLP: We conduct experiments using Language Modeling (LM) on Penn Tree Bank [23] and Neural Machine Translation (NMT) on IWSLT14 German-to-English (De-En) [6] datasets... CV: We conduct experiments using Res Net20 and Res Net32 on the Cifar10 [19] dataset, and Res Net18 on the Image Net [30] dataset... Rec Sys: We conduct experiments on two widely used datasets, Avazu [3] and Criteo [11]
Dataset Splits	Yes	Table 2: Experiments setup. Task Dataset Model Train Val/Test Params... NLP-LM PTB ... 730K/82K ... NLP-NMT IWSLT14 De-En ... 153K 7K/7K ... CV Cifar10 ... 50K 10K ... Image Net ... 1.28M 50K
Hardware Specification	Yes	We train a Transformer small model for IWSLT14 on a single NVIDIA P100 GPU.
Software Dependencies	No	For our NLP and CV experiments, we utilize GPUs with the Py Torch framework [26], while our Rec Sys experiments are conducted with three parameter servers and ﬁve workers in the Tensor Flow framework [1]. However, specific version numbers for these software frameworks are not provided.
Experiment Setup	Yes	Section 4.1 'Experiment setup' provides details on training processes (e.g., '160 epochs with a learning rate decay at epochs 80 and 120 by a factor of 10 for Cifar10'), batch sizes ('batch size for both datasets is set to 256'), and models. Appendix A.1 'Configuration of optimizers' gives extensive detail on hyperparameter search ranges and chosen values for various optimizers across different tasks (e.g., 'learning rate among {5e-5, 1e-4, 5e-4, 1e-3} and epsilon in {1e-16, 1e-14, 1e-12, 1e-10, 1e-8}').