A Layer-Wise Natural Gradient Optimizer for Training Deep Neural Networks
Authors: Xiaolei Liu, Shaoshuai Li, Kaixin Gao, Binfeng Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on image classification and machine translation tasks show that our method is quite competitive compared to the state-of-the-art methods. and We perform experiments on image classification and machine translation tasks. Numerical results show that LNGD converges faster than SGD, ADAM and KFAC, and LNGD provides an significant improvement in computational time savings when achieves convergence. |
| Researcher Affiliation | Collaboration | Xiaolei Liu Ant Group Hangzhou, China liuxiaolei.lxl@mybank.cn; Shaoshuai Li Ant Group Hangzhou, China lishaoshuai.lss@mybank.cn; Kaixin Gao Ocean University of China Qingdao, China gaokaixin06@163.com; Binfeng Wang Ant Group Hangzhou, China wangbinfeng.wbf@mybank.cn |
| Pseudocode | Yes | Algorithm 1 LNGD |
| Open Source Code | No | Due to the involvement of proprietary code resources, the disclosure of such materials must adhere to the company s relevant disclosure processes. If necessary, data and code can be made available upon request. |
| Open Datasets | Yes | We first report the optimizing performance on CIFAR-10 [40], which is a standard task used to benchmark optimization methods [6, 41, 42, 43, 44]. and We extend our examination of optimizer efficacy to a larger image classification dataset, Image Net-1K [45]. |
| Dataset Splits | No | The paper discusses training and testing, and uses CIFAR-10 and ImageNet, which have standard splits, but it does not explicitly provide specific percentages, sample counts, or detailed methodology for a distinct validation set split within its experimental setup description. |
| Hardware Specification | Yes | All experiments run on a single A100 GPU using Tensor Flow. |
| Software Dependencies | No | All experiments run on a single A100 GPU using Tensor Flow. |
| Experiment Setup | Yes | Unless otherwise stated, the batch size for all experiments in the following is set to 256. The initial learning rate hyperparameters for all optimizers are tuned using a grid search with values α {1e 4, 3e 4, . . . , 1, 3}. The damping parameter λ in KFAC[14] are tuned using a grid search with values λ {1e 6, 1e 4, 3e 4, 1e 3, . . . , 1e 1, 3e 1}. The minimum and maximum of damping parameters ν1 and ν2 in LNGD are set to 1e 5 and 1e 2. The moving average parameter and the momentum correlating with KFAC and LNGD are set to 0.95 and 0.9, respectively. Furthermore, a weight decay of 0.004 is applied in all optimizers. All experimental runs are conducted over a duration of 200 epochs. |