An Improved Empirical Fisher Approximation for Natural Gradient Descent
Authors: Xiaodong Wu, Wenyi Yu, Chao Zhang, Phil Woodland
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Optimisation experiments show that applying exact i EF directly as an optimiser provides strong convergence and generalisation. It achieves the best test performance and the lowest training loss for the majority of the tasks |
| Researcher Affiliation | Academia | Xiaodong Wu1 Wenyi Yu2 Chao Zhang2 Philip Woodland1 1Dept. of Engineering, University of Cambridge 2Dept. of Electronic Engineering, Tsinghua University |
| Pseudocode | Yes | Algorithm 1 Stochastic Optimisation with Exact IEF, Algorithm 2 Stochastic Optimisation with Exact EF, Algorithm 3 Stochastic Optimisation with Exact SF, Algorithm 4 Empirical evaluation framework for approximate NGD Methods, Algorithm 5 The Linear CG Algorithm in Hessian-free Method |
| Open Source Code | No | 1Codebase will be publicly released. |
| Open Datasets | Yes | The exact i EF and EF methods are experimentally evaluated using practical deep learning setups, including widely-used setups for parameter-efficient fine-tuning of pre-trained models (T5-base with Lo RA and Prompt-Tuning on GLUE tasks, and Vi T with Lo RA for CIFAR100). The well-known CIFAR100 dataset [19] is used to finetune the pretrained Vi T model [8]. |
| Dataset Splits | Yes | The validation accuracy (on the dev set of each task) of the best checkpoint is reported and sent for test evaluation (for GLUE, the checkpoints are submitted to GLUE website [50]). |
| Hardware Specification | Yes | All the optimisation experiments and evaluation experiments are run on a cloud linux machine with 8 A100 GPUs with 80GB GRAM. |
| Software Dependencies | No | In Pytorch [32], the per-sample gradients are readily computed during back-propagation, but are usually accumulated along the batch dimension to compute the total gradient, and are not available for collection directly. |
| Experiment Setup | Yes | For the Adafactor baseline optimiser, the hyper-parameter provided in [7] was used (which comes from [23]): weight-decay 1 10 5, β2 = 0.8, learning rate η = 0.3 and no parameter scaling. For the SGD method, the learning rate is η = 100, which was searched from {0.1, 1, 10, 20, 50, 100}. For the i EF method, the learning rate was η = 50, which was searched from {1, 10, 50, 100}. For the EF method, a different scheduling of learning rate was used to guarantee convergence, due to the inverse scaling of EF updates. The chosen strategy was a linearly decaying normalised update, with the first update being normalised to 1 ({1 10 3, 5 10 3, 1 10 2, 1 10 1, 1, 10}) and linearly decaying to 0. The SF method was trained using the same method as EF with the same set of hyperparameters. All optimisers were trained on a batch size of 32. |