A Layer-Wise Natural Gradient Optimizer for Training Deep Neural Networks

Authors: Xiaolei Liu, Shaoshuai Li, Kaixin Gao, Binfeng Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on image classification and machine translation tasks show that our method is quite competitive compared to the state-of-the-art methods. and We perform experiments on image classification and machine translation tasks. Numerical results show that LNGD converges faster than SGD, ADAM and KFAC, and LNGD provides an significant improvement in computational time savings when achieves convergence.
Researcher Affiliation Collaboration Xiaolei Liu Ant Group Hangzhou, China liuxiaolei.lxl@mybank.cn; Shaoshuai Li Ant Group Hangzhou, China lishaoshuai.lss@mybank.cn; Kaixin Gao Ocean University of China Qingdao, China gaokaixin06@163.com; Binfeng Wang Ant Group Hangzhou, China wangbinfeng.wbf@mybank.cn
Pseudocode Yes Algorithm 1 LNGD
Open Source Code No Due to the involvement of proprietary code resources, the disclosure of such materials must adhere to the company s relevant disclosure processes. If necessary, data and code can be made available upon request.
Open Datasets Yes We first report the optimizing performance on CIFAR-10 [40], which is a standard task used to benchmark optimization methods [6, 41, 42, 43, 44]. and We extend our examination of optimizer efficacy to a larger image classification dataset, Image Net-1K [45].
Dataset Splits No The paper discusses training and testing, and uses CIFAR-10 and ImageNet, which have standard splits, but it does not explicitly provide specific percentages, sample counts, or detailed methodology for a distinct validation set split within its experimental setup description.
Hardware Specification Yes All experiments run on a single A100 GPU using Tensor Flow.
Software Dependencies No All experiments run on a single A100 GPU using Tensor Flow.
Experiment Setup Yes Unless otherwise stated, the batch size for all experiments in the following is set to 256. The initial learning rate hyperparameters for all optimizers are tuned using a grid search with values α {1e 4, 3e 4, . . . , 1, 3}. The damping parameter λ in KFAC[14] are tuned using a grid search with values λ {1e 6, 1e 4, 3e 4, 1e 3, . . . , 1e 1, 3e 1}. The minimum and maximum of damping parameters ν1 and ν2 in LNGD are set to 1e 5 and 1e 2. The moving average parameter and the momentum correlating with KFAC and LNGD are set to 0.95 and 0.9, respectively. Furthermore, a weight decay of 0.004 is applied in all optimizers. All experimental runs are conducted over a duration of 200 epochs.