reproducibilityindex.ai

A Layer-Wise Natural Gradient Optimizer for Training Deep Neural Networks

Authors: Xiaolei Liu, Shaoshuai Li, Kaixin Gao, Binfeng Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on image classification and machine translation tasks show that our method is quite competitive compared to the state-of-the-art methods. and We perform experiments on image classification and machine translation tasks. Numerical results show that LNGD converges faster than SGD, ADAM and KFAC, and LNGD provides an significant improvement in computational time savings when achieves convergence.
Researcher Affiliation	Collaboration	Xiaolei Liu Ant Group Hangzhou, China liuxiaolei.lxl@mybank.cn; Shaoshuai Li Ant Group Hangzhou, China lishaoshuai.lss@mybank.cn; Kaixin Gao Ocean University of China Qingdao, China gaokaixin06@163.com; Binfeng Wang Ant Group Hangzhou, China wangbinfeng.wbf@mybank.cn
Pseudocode	Yes	Algorithm 1 LNGD
Open Source Code	No	Due to the involvement of proprietary code resources, the disclosure of such materials must adhere to the company s relevant disclosure processes. If necessary, data and code can be made available upon request.
Open Datasets	Yes	We first report the optimizing performance on CIFAR-10 [40], which is a standard task used to benchmark optimization methods [6, 41, 42, 43, 44]. and We extend our examination of optimizer efficacy to a larger image classification dataset, Image Net-1K [45].
Dataset Splits	No	The paper discusses training and testing, and uses CIFAR-10 and ImageNet, which have standard splits, but it does not explicitly provide specific percentages, sample counts, or detailed methodology for a distinct validation set split within its experimental setup description.
Hardware Specification	Yes	All experiments run on a single A100 GPU using Tensor Flow.
Software Dependencies	No	All experiments run on a single A100 GPU using Tensor Flow.
Experiment Setup	Yes	Unless otherwise stated, the batch size for all experiments in the following is set to 256. The initial learning rate hyperparameters for all optimizers are tuned using a grid search with values α {1e 4, 3e 4, . . . , 1, 3}. The damping parameter λ in KFAC[14] are tuned using a grid search with values λ {1e 6, 1e 4, 3e 4, 1e 3, . . . , 1e 1, 3e 1}. The minimum and maximum of damping parameters ν1 and ν2 in LNGD are set to 1e 5 and 1e 2. The moving average parameter and the momentum correlating with KFAC and LNGD are set to 0.95 and 0.9, respectively. Furthermore, a weight decay of 0.004 is applied in all optimizers. All experimental runs are conducted over a duration of 200 epochs.