reproducibilityindex.ai

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures

Authors: Runa Eschenhagen, Alexander Immer, Richard Turner, Frank Schneider, Philipp Hennig

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically verify this speed difference with a Wide Res Net on CIFAR-10. Moreover, we show that the two K-FAC variations applied to a graph neural network and vision transformer can reach a ﬁxed validation metric target in 50-75% of the number of steps of a ﬁrst-order reference run, which translates into a comparable improvement in wall-clock time.
Researcher Affiliation	Academia	1University of Cambridge 2Department of Computer Science, ETH Zurich 3Max Planck Institute for Intelligent Systems 4University of Tübingen 5Tübingen AI Center
Pseudocode	Yes	Listing 1: Illustration of K-FAC-expand and K-FAC-reduce with code.
Open Source Code	No	The paper states: "Our K-FAC implementation leverages the ASDL package (Osawa et al., 2023)." This indicates they used an existing, third-party package for their implementation, but does not explicitly state that their specific modifications or the code for their new flavours of K-FAC-expand and K-FAC-reduce are open-source or provided with a link.
Open Datasets	Yes	We empirically verify this speed difference with a Wide Res Net on CIFAR-10. For this, we use a GNN and a vision transformer (Vi T), two of the examples in Section 2.2. We use a basic Graph Network (Battaglia et al., 2018) with multi-layer perceptrons as the update functions (c.f. Appendix A) and train it on the ogbg-molpcba dataset (Hu et al., 2020). We train a Vi T (Dosovitskiy et al., 2021) on LSVRC-2012 Image Net (Russakovsky et al., 2015).
Dataset Splits	Yes	For training, we have about 350k examples and almost 44k for validation and testing. There are about 1.3 million training, 50k validation, and 10k test examples.
Hardware Specification	Yes	To empirically validate the smaller computational complexity of K-FAC-reduce compared to K-FAC-expand, we time a single preconditioned gradient update step for a Wide Res Net on CIFAR-10 with both approximations and ﬁve different batch sizes on an NVIDIA V100 GPU. We use a training batch size of 512 and a single NVIDIA V100 32GB GPU for each run. We use a training batch size of 1,024, and 4 NVIDIA V100 32 GPUs for all runs on this workload. The results are averaged over three seeds and the timings are obtained running all runs on a single NVIDIA A100 GPU.
Software Dependencies	No	The paper mentions software frameworks like "Jax (Bradbury et al., 2018)" and "Py Torch (Paszke et al., 2019)" and packages like "ASDL (Osawa et al., 2023)" but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup	Yes	The reference run algorithm (Nesterov) uses a learning rate of 2.4917728606918423, β1 equal to 0.9449369031171744, weight decay set to 1.285 964 054 102 592 8e 7, and linear warmup for 3,000 steps and a polynomial schedule with a decay steps factor of 0.861509027839639 and an end factor of 1e 3. The two K-FAC variations use the exact same hyperparameters, but the warmup and expected number of steps (60,000) are multiplied by 0.75 and the learning rate and the damping are tuned via random search.