Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures
Authors: Runa Eschenhagen, Alexander Immer, Richard Turner, Frank Schneider, Philipp Hennig
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically verify this speed difference with a Wide Res Net on CIFAR-10. Moreover, we show that the two K-FAC variations applied to a graph neural network and vision transformer can reach a fixed validation metric target in 50-75% of the number of steps of a first-order reference run, which translates into a comparable improvement in wall-clock time. |
| Researcher Affiliation | Academia | 1University of Cambridge 2Department of Computer Science, ETH Zurich 3Max Planck Institute for Intelligent Systems 4University of Tübingen 5Tübingen AI Center |
| Pseudocode | Yes | Listing 1: Illustration of K-FAC-expand and K-FAC-reduce with code. |
| Open Source Code | No | The paper states: "Our K-FAC implementation leverages the ASDL package (Osawa et al., 2023)." This indicates they used an existing, third-party package for their implementation, but does not explicitly state that *their specific modifications or the code for their new flavours of K-FAC-expand and K-FAC-reduce* are open-source or provided with a link. |
| Open Datasets | Yes | We empirically verify this speed difference with a Wide Res Net on CIFAR-10. For this, we use a GNN and a vision transformer (Vi T), two of the examples in Section 2.2. We use a basic Graph Network (Battaglia et al., 2018) with multi-layer perceptrons as the update functions (c.f. Appendix A) and train it on the ogbg-molpcba dataset (Hu et al., 2020). We train a Vi T (Dosovitskiy et al., 2021) on LSVRC-2012 Image Net (Russakovsky et al., 2015). |
| Dataset Splits | Yes | For training, we have about 350k examples and almost 44k for validation and testing. There are about 1.3 million training, 50k validation, and 10k test examples. |
| Hardware Specification | Yes | To empirically validate the smaller computational complexity of K-FAC-reduce compared to K-FAC-expand, we time a single preconditioned gradient update step for a Wide Res Net on CIFAR-10 with both approximations and five different batch sizes on an NVIDIA V100 GPU. We use a training batch size of 512 and a single NVIDIA V100 32GB GPU for each run. We use a training batch size of 1,024, and 4 NVIDIA V100 32 GPUs for all runs on this workload. The results are averaged over three seeds and the timings are obtained running all runs on a single NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions software frameworks like "Jax (Bradbury et al., 2018)" and "Py Torch (Paszke et al., 2019)" and packages like "ASDL (Osawa et al., 2023)" but does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | The reference run algorithm (Nesterov) uses a learning rate of 2.4917728606918423, β1 equal to 0.9449369031171744, weight decay set to 1.285 964 054 102 592 8e 7, and linear warmup for 3,000 steps and a polynomial schedule with a decay steps factor of 0.861509027839639 and an end factor of 1e 3. The two K-FAC variations use the exact same hyperparameters, but the warmup and expected number of steps (60,000) are multiplied by 0.75 and the learning rate and the damping are tuned via random search. |