Gradient Descent on Neurons and its Link to Approximate Second-order Optimization

Authors: Frederik Benzing

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we combine tools from prior work to evaluate exact second-order updates with careful ablations to establish a surprising result: Due to its approximations, KFAC is not closely related to second-order updates, and in particular, it significantly outperforms true second-order updates. This challenges widely held believes and immediately raises the question why KFAC performs so well. Towards answering this question we present evidence strongly suggesting that KFAC approximates a first-order algorithm, which performs gradient descent on neurons rather than weights. Finally, we show that this optimizer often improves over KFAC in terms of computational cost and data-efficiency.
Researcher Affiliation Academia 1Department of Computer Science, ETH Zurich, Zurich, Switzerland. Correspondence to: Frederik Benzing <benzingf@inf.ethz.ch>.
Pseudocode Yes Pseudocode for FOOF is given in Algorithm 1.
Open Source Code Yes We re-emphasise that our results are consistent with many experiments from prior work. To obtain easily interpretable results without unnecessary confounders, we choose a constant step size for all methods, and a constant damping term. This matches the setup of prior work (Desjardins et al., 2015; Zhang et al., 2018b; George et al., 2018; Goldfarb et al., 2020). We re-emphasise that these hyperparameters are optimized carefully and indepently for each method and experiment individually. Following the default choice in the KFAC literature (Martens and Grosse, 2015), we usually use a Monte Carlo estimate of the Fisher, based on sampling one label per input. We will also carry out controls with the Full Fisher. Code to validate and run the software is provided.
Open Datasets Yes The first set of experiments is carried out on a fully connected network on Fashion MNIST (Xiao et al., 2017) and followed by results on a Wide Res Net (He et al., 2016) on CIFAR10 (Krizhevsky, 2009).
Dataset Splits No The paper mentions using well-known datasets like Fashion MNIST and CIFAR10, and restricting to a subset of 1000 images for some experiments, but it does not explicitly provide specific percentages, counts, or citations for standard training/validation/test splits.
Hardware Specification No The paper states that experiments were "run on a GPU" and mentions using "Py Torch Data Loaders" but does not specify any particular GPU model, CPU, or other hardware components used for running the experiments.
Software Dependencies No All experiments were implemented in Py Torch (Paszke et al., 2019). While PyTorch is mentioned and cited, a specific version number is not provided, which is required for reproducibility.
Experiment Setup Yes Learning rates for all methods were tuned by a grid search, considering values of the form 1 * 10^i, 3 * 10^i for suitable (usually negative) integers i. The damping terms for Natural Gradients, KFAC, FOOF were determined by a grid searcher over 10^-6, 10^-4, 10^-2, 10^0, 10^2, 10^4, 10^6 on Fashion MNIST and MNIST. ... Unless noted otherwise, we trained networks for 10 epochs which batch size 100 on MNIST or Fashion MNIST. ... HP values for experiments from main paper in Table D.4.