An Improved Empirical Fisher Approximation for Natural Gradient Descent

Authors: Xiaodong Wu, Wenyi Yu, Chao Zhang, Phil Woodland

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Optimisation experiments show that applying exact i EF directly as an optimiser provides strong convergence and generalisation. It achieves the best test performance and the lowest training loss for the majority of the tasks
Researcher Affiliation Academia Xiaodong Wu1 Wenyi Yu2 Chao Zhang2 Philip Woodland1 1Dept. of Engineering, University of Cambridge 2Dept. of Electronic Engineering, Tsinghua University
Pseudocode Yes Algorithm 1 Stochastic Optimisation with Exact IEF, Algorithm 2 Stochastic Optimisation with Exact EF, Algorithm 3 Stochastic Optimisation with Exact SF, Algorithm 4 Empirical evaluation framework for approximate NGD Methods, Algorithm 5 The Linear CG Algorithm in Hessian-free Method
Open Source Code No 1Codebase will be publicly released.
Open Datasets Yes The exact i EF and EF methods are experimentally evaluated using practical deep learning setups, including widely-used setups for parameter-efficient fine-tuning of pre-trained models (T5-base with Lo RA and Prompt-Tuning on GLUE tasks, and Vi T with Lo RA for CIFAR100). The well-known CIFAR100 dataset [19] is used to finetune the pretrained Vi T model [8].
Dataset Splits Yes The validation accuracy (on the dev set of each task) of the best checkpoint is reported and sent for test evaluation (for GLUE, the checkpoints are submitted to GLUE website [50]).
Hardware Specification Yes All the optimisation experiments and evaluation experiments are run on a cloud linux machine with 8 A100 GPUs with 80GB GRAM.
Software Dependencies No In Pytorch [32], the per-sample gradients are readily computed during back-propagation, but are usually accumulated along the batch dimension to compute the total gradient, and are not available for collection directly.
Experiment Setup Yes For the Adafactor baseline optimiser, the hyper-parameter provided in [7] was used (which comes from [23]): weight-decay 1 10 5, β2 = 0.8, learning rate η = 0.3 and no parameter scaling. For the SGD method, the learning rate is η = 100, which was searched from {0.1, 1, 10, 20, 50, 100}. For the i EF method, the learning rate was η = 50, which was searched from {1, 10, 50, 100}. For the EF method, a different scheduling of learning rate was used to guarantee convergence, due to the inverse scaling of EF updates. The chosen strategy was a linearly decaying normalised update, with the first update being normalised to 1 ({1 10 3, 5 10 3, 1 10 2, 1 10 1, 1, 10}) and linearly decaying to 0. The SF method was trained using the same method as EF with the same set of hyperparameters. All optimisers were trained on a batch size of 32.