AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix
Authors: Yun Yue, Zhiling Ye, Jiadi Jiang, Yongchao Liu, Ke Zhang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate AGD on public datasets of Natural Language Processing (NLP), Computer Vision (CV), and Recommendation Systems (Rec Sys). Our experimental results demonstrate that AGD outperforms the state-of-the-art (SOTA) optimizers, achieving highly competitive or significantly better predictive performance. |
| Researcher Affiliation | Industry | Yun Yue Ant Group Hangzhou, Zhejiang, China yueyun.yy@antgroup.com Zhiling Ye Ant Group Hangzhou, Zhejiang, China yezhiling.yzl@antgroup.com Jiadi Jiang Ant Group Hangzhou, Zhejiang, China jiadi.jjd@antgroup.com Yongchao Liu Ant Group Hangzhou, Zhejiang, China yongchao.ly@antgroup.com Ke Zhang Ant Group Beijing, China yingzi.zk@antgroup.com |
| Pseudocode | Yes | Algorithm 1 AGD... Algorithm 2 AGD with AMSGrad condition |
| Open Source Code | Yes | The code is available at this link2. 2https://github.com/intelligent-machine-learning/dlrover/tree/master/atorch/ atorch/optimizers |
| Open Datasets | Yes | NLP: We conduct experiments using Language Modeling (LM) on Penn Tree Bank [23] and Neural Machine Translation (NMT) on IWSLT14 German-to-English (De-En) [6] datasets... CV: We conduct experiments using Res Net20 and Res Net32 on the Cifar10 [19] dataset, and Res Net18 on the Image Net [30] dataset... Rec Sys: We conduct experiments on two widely used datasets, Avazu [3] and Criteo [11] |
| Dataset Splits | Yes | Table 2: Experiments setup. Task Dataset Model Train Val/Test Params... NLP-LM PTB ... 730K/82K ... NLP-NMT IWSLT14 De-En ... 153K 7K/7K ... CV Cifar10 ... 50K 10K ... Image Net ... 1.28M 50K |
| Hardware Specification | Yes | We train a Transformer small model for IWSLT14 on a single NVIDIA P100 GPU. |
| Software Dependencies | No | For our NLP and CV experiments, we utilize GPUs with the Py Torch framework [26], while our Rec Sys experiments are conducted with three parameter servers and five workers in the Tensor Flow framework [1]. However, specific version numbers for these software frameworks are not provided. |
| Experiment Setup | Yes | Section 4.1 'Experiment setup' provides details on training processes (e.g., '160 epochs with a learning rate decay at epochs 80 and 120 by a factor of 10 for Cifar10'), batch sizes ('batch size for both datasets is set to 256'), and models. Appendix A.1 'Configuration of optimizers' gives extensive detail on hyperparameter search ranges and chosen values for various optimizers across different tasks (e.g., 'learning rate among {5e-5, 1e-4, 5e-4, 1e-3} and epsilon in {1e-16, 1e-14, 1e-12, 1e-10, 1e-8}'). |