Structured Inverse-Free Natural Gradient Descent: Memory-Efficient & Numerically-Stable KFAC

Authors: Wu Lin, Felix Dangel, Runa Eschenhagen, Kirill Neklyudov, Agustinus Kristiadi, Richard E. Turner, Alireza Makhzani

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On modern neural networks, we show that SINGD is memory-efficient and numerically robust, in contrast to KFAC, and often outperforms Adam W even in half precision. Our work closes a gap between firstand second-order methods in modern low-precision training.
Researcher Affiliation Academia 1Vector Institute 2University of Cambridge 3University of Toronto.
Pseudocode Yes Figure 3: Comparison between KFAC and IKFAC update for one weight matrix vec 1(µ) Rdo di. Figure 4: Comparison of a single weight matrix s update between INGD and our extension SINGD via structured Kronecker factors.
Open Source Code Yes Py Torch implementation: github.com/f-dangel/singd
Open Datasets Yes We consider image classification tasks with transformerbased models such as Compact-Vi T" (Hassani et al., 2021), Swin-Vi T" (Liu et al., 2021), GC-Vi T" (Hatamizadeh et al., 2023), and HDVT (Lu et al., 2022). We also consider convolution-based models such as VGG (Simonyan & Zisserman, 2014), Conv Mixer (Trockman & Kolter, 2023), and Rep-Vi T" (Wang et al., 2023). We train these models on datasets CIFAR-100" and Image Woof-10". Note that Rep-Vi T" is a CNN model inspired by transformers while Compact-Vi T" is a data-efficient transformer using convolutional tokenization. We also consider a graph convolution model (Kipf & Welling, 2016) denoted by GNN for node classification on dataset Cora". We also train a Vi T model on Image Net-100" (https://www.kaggle.com/datasets/ambityga/imagenet100) to demonstrate the performance of SINGD in large-scale settings (see Fig. 9).
Dataset Splits No The paper states: "tune other hyper-parameters of each optimizer using random search" in Section 4. This implies the use of a validation set, but it does not specify the exact split percentages or the methodology for creating the validation set.
Hardware Specification No The paper mentions general concepts like "computational power" and "low precision data types" but does not specify any particular hardware (CPU, GPU, or TPU models, or cloud computing instances) used for the experiments.
Software Dependencies No The paper mentions "Py Torch" and "JAX" as implementation frameworks. It also provides a GitHub link to a "Py Torch implementation." However, it does not specify version numbers for any of these software dependencies.
Experiment Setup Yes We fix momentum to 0.9 and tune other hyper-parameters of each optimizer using random search. For VGG and Conv Mixer , we decrease the learning rate β2 every 40 epochs. For GNN , we use a constant learning rate; all other models use a cosine learning rate schedule. We consider KFAC as a strong baseline for the GNN as suggested by Izadi et al. (2020). We train the GNN in FP-32 so that KFAC performs stably. The search space for the random search can be found in Table 5 in Appendix B.