Fisher-Legendre (FishLeg) optimization of deep neural networks

Authors: Jezabel R Garcia, Federica Freddi, Stathi Fotiadis, Maolin Li, Sattar Vakili, Alberto Bernacchia, Guillaume Hennequin

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove that the resulting Fisher-Legendre (Fish Leg) optimizer converges to a (global) minimum of non-convex functions satisfying the PL condition, which applies in particular to deep linear networks. On standard auto-encoder benchmarks, we show empirically that Fish Leg outperforms standard first-order optimization methods, and performs on par with or better than other second-order methods, especially when using small batches.
Researcher Affiliation Collaboration Jezabel R Garcia1 , Federica Freddi1 , Stathi Fotiadis1, Maolin Li1, Sattar Vakili1, Alberto Bernacchia1 & Guillaume Hennequin1,2, 1. Media Tek Research, Cambourne Business Park, CB23 6DW, UK first.last@mtkresearch.com 2. Computational and Biological Learning Lab, Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, UK g.hennequin@eng.cam.ac.uk
Pseudocode Yes Algorithm 1 Fish Leg algorithm (online setting) is provided in Appendix A.1.
Open Source Code Yes Our code is available here on Git Hub.
Open Datasets Yes We applied Fish Leg to the auto-encoders benchmarks previously used to compare second-order optimization methods the details of these experiments (model architectures, datasets, etc) can be found in (Goldfarb et al., 2020). [...] Fish Leg performed similarly to KFAC and KBFGS on the FACES and MNIST datasets...
Dataset Splits No The paper mentions training loss and test error but does not explicitly specify a validation dataset split or its proportion/purpose. While hyperparameters are optimized, it is not clear if a dedicated validation set was used for this purpose, or if cross-validation was employed instead.
Hardware Specification Yes We ran a clean wallclock-time comparison between SGDm, KFAC and Fish Leg using a unified CPU-only implementation applied to the FACES and MNIST benchmarks. This ensured e.g. that the loss and its gradients were computed in exactly the same way across methods. Overall, one iteration of vanilla Fish Leg was 5 times slower than one iteration of SGDm. However, we were able to bring this down to only twice slower by updating λ every 10 iterations, which did not significantly affect performance. Combined with Fish Leg s faster progress per-iteration, this meant that Fish Leg retained a significant advantage in wall-clock time over SGD (Fig.3), similar to KFAC. In practice we think that it might make sense to update λ more frequently at the beginning of training, and let these updates become sparser as optimization progresses. CPU (Intel Xeon Platinum 8380H @ 2.90GHz) with Open BLAS compiled for that architecture and multi-threaded with Open MP (8 threads).
Software Dependencies No The paper mentions 'Open BLAS compiled for that architecture and multi-threaded with Open MP (8 threads)' but does not specify version numbers for these software components. Therefore, a fully reproducible description of ancillary software is not provided.
Experiment Setup Yes Table 1: Optimal hyperparameter values for Fish Leg, identified as the result of a grid search over the space shown in Table 2. These hyperparameters were chosen to minimise the training loss. [...] Parameters: minibatch size = 40, η = 0.04, α = 0.001, β = 0.9, ηSGDm = 0.002, ηAdam = 0.0002.