BackPACK: Packing more into Backprop
Authors: Felix Dangel, Frederik Kunstner, Philipp Hennig
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To illustrate the capabilities of BACKPACK, we use it to implement preconditioned gradient descent optimizers with diagonal approximations of the GGN and recent Kronecker factorizations KFAC (Martens & Grosse, 2015), KFLR, and KFRA (Botev et al., 2017). Our results show that the curvature approximations based on Monte-Carlo (MC) estimates of the GGN, the approach used by KFAC, give similar progress per iteration to their more accurate counterparts, but being much cheaper to compute. While the na ıve update rule we implement does not surpass first-order baselines such as SGD with momentum and Adam (Kingma & Ba, 2015), its implementation with various curvature approximations is made straightforward. (Section 1.1) and We benchmark the overhead of BACKPACK on the CIFAR-10 and CIFAR-100 datasets, using the 3C3D network3 provided by DEEPOBS (Schneider et al., 2019) and the ALL-CNN-C4 network of Springenberg et al. (2015). The results are shown in Fig. 6. (Section 3) |
| Researcher Affiliation | Academia | Felix Dangel University of Tuebingen fdangel@tue.mpg.de Frederik Kunstner University of Tuebingen kunstner@cs.ubc.ca Philipp Hennig University of Tuebingen and MPI for Intelligent Systems, Tuebingen ph@tue.mpg.de |
| Pseudocode | No | No formally labeled pseudocode or algorithm block found. Figure 1 shows code snippets but is not a pseudocode block. |
| Open Source Code | Yes | we provide an implementation on top of PYTORCH, coined BACKPACK, available at https://f-dangel.github.io/backpack/. |
| Open Datasets | Yes | We benchmark the overhead of BACKPACK on the CIFAR-10 and CIFAR-100 datasets, using the 3C3D network3 provided by DEEPOBS (Schneider et al., 2019) and the ALL-CNN-C4 network of Springenberg et al. (2015). |
| Dataset Splits | Yes | The results shown in this work were obtained with the default strategy, favoring highest final accuracy on the validation set. (Section C.1) and The best hyperparameter settings is chosen according to the final accuracy on a validation set. (Section 4) |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models) were mentioned for running experiments. |
| Software Dependencies | No | No specific version numbers were provided for software dependencies. Only names like PYTORCH are mentioned. |
| Experiment Setup | Yes | Both the learning rate α and damping λ are tuned over the grid α ∈ {10−4, 10−3, 10−2, 10−1, 1}, λ ∈ {10−4, 10−3, 10−2, 10−1, 1, 10}. (Section C.2) and We use the same batch size (N = 128 for all problems, except N = 256 for ALL-CNN-C on CIFAR-100) as the baselines and the optimizers run for the identical number of epochs. (Section C.2) |