BackPACK: Packing more into Backprop

Authors: Felix Dangel, Frederik Kunstner, Philipp Hennig

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To illustrate the capabilities of BACKPACK, we use it to implement preconditioned gradient descent optimizers with diagonal approximations of the GGN and recent Kronecker factorizations KFAC (Martens & Grosse, 2015), KFLR, and KFRA (Botev et al., 2017). Our results show that the curvature approximations based on Monte-Carlo (MC) estimates of the GGN, the approach used by KFAC, give similar progress per iteration to their more accurate counterparts, but being much cheaper to compute. While the na ıve update rule we implement does not surpass first-order baselines such as SGD with momentum and Adam (Kingma & Ba, 2015), its implementation with various curvature approximations is made straightforward. (Section 1.1) and We benchmark the overhead of BACKPACK on the CIFAR-10 and CIFAR-100 datasets, using the 3C3D network3 provided by DEEPOBS (Schneider et al., 2019) and the ALL-CNN-C4 network of Springenberg et al. (2015). The results are shown in Fig. 6. (Section 3)
Researcher Affiliation Academia Felix Dangel University of Tuebingen fdangel@tue.mpg.de Frederik Kunstner University of Tuebingen kunstner@cs.ubc.ca Philipp Hennig University of Tuebingen and MPI for Intelligent Systems, Tuebingen ph@tue.mpg.de
Pseudocode No No formally labeled pseudocode or algorithm block found. Figure 1 shows code snippets but is not a pseudocode block.
Open Source Code Yes we provide an implementation on top of PYTORCH, coined BACKPACK, available at https://f-dangel.github.io/backpack/.
Open Datasets Yes We benchmark the overhead of BACKPACK on the CIFAR-10 and CIFAR-100 datasets, using the 3C3D network3 provided by DEEPOBS (Schneider et al., 2019) and the ALL-CNN-C4 network of Springenberg et al. (2015).
Dataset Splits Yes The results shown in this work were obtained with the default strategy, favoring highest final accuracy on the validation set. (Section C.1) and The best hyperparameter settings is chosen according to the final accuracy on a validation set. (Section 4)
Hardware Specification No No specific hardware details (e.g., GPU/CPU models) were mentioned for running experiments.
Software Dependencies No No specific version numbers were provided for software dependencies. Only names like PYTORCH are mentioned.
Experiment Setup Yes Both the learning rate α and damping λ are tuned over the grid α ∈ {10−4, 10−3, 10−2, 10−1, 1}, λ ∈ {10−4, 10−3, 10−2, 10−1, 1, 10}. (Section C.2) and We use the same batch size (N = 128 for all problems, except N = 256 for ALL-CNN-C on CIFAR-100) as the baselines and the optimizers run for the identical number of epochs. (Section C.2)