Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Structured Inverse-Free Natural Gradient Descent: Memory-Efficient & Numerically-Stable KFAC
Authors: Wu Lin, Felix Dangel, Runa Eschenhagen, Kirill Neklyudov, Agustinus Kristiadi, Richard E. Turner, Alireza Makhzani
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On modern neural networks, we show that SINGD is memory-efficient and numerically robust, in contrast to KFAC, and often outperforms Adam W even in half precision. Our work closes a gap between firstand second-order methods in modern low-precision training. |
| Researcher Affiliation | Academia | 1Vector Institute 2University of Cambridge 3University of Toronto. |
| Pseudocode | Yes | Figure 3: Comparison between KFAC and IKFAC update for one weight matrix vec 1(ยต) Rdo di. Figure 4: Comparison of a single weight matrix s update between INGD and our extension SINGD via structured Kronecker factors. |
| Open Source Code | Yes | Py Torch implementation: github.com/f-dangel/singd |
| Open Datasets | Yes | We consider image classification tasks with transformerbased models such as Compact-Vi T" (Hassani et al., 2021), Swin-Vi T" (Liu et al., 2021), GC-Vi T" (Hatamizadeh et al., 2023), and HDVT (Lu et al., 2022). We also consider convolution-based models such as VGG (Simonyan & Zisserman, 2014), Conv Mixer (Trockman & Kolter, 2023), and Rep-Vi T" (Wang et al., 2023). We train these models on datasets CIFAR-100" and Image Woof-10". Note that Rep-Vi T" is a CNN model inspired by transformers while Compact-Vi T" is a data-efficient transformer using convolutional tokenization. We also consider a graph convolution model (Kipf & Welling, 2016) denoted by GNN for node classification on dataset Cora". We also train a Vi T model on Image Net-100" (https://www.kaggle.com/datasets/ambityga/imagenet100) to demonstrate the performance of SINGD in large-scale settings (see Fig. 9). |
| Dataset Splits | No | The paper states: "tune other hyper-parameters of each optimizer using random search" in Section 4. This implies the use of a validation set, but it does not specify the exact split percentages or the methodology for creating the validation set. |
| Hardware Specification | No | The paper mentions general concepts like "computational power" and "low precision data types" but does not specify any particular hardware (CPU, GPU, or TPU models, or cloud computing instances) used for the experiments. |
| Software Dependencies | No | The paper mentions "Py Torch" and "JAX" as implementation frameworks. It also provides a GitHub link to a "Py Torch implementation." However, it does not specify version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We fix momentum to 0.9 and tune other hyper-parameters of each optimizer using random search. For VGG and Conv Mixer , we decrease the learning rate ฮฒ2 every 40 epochs. For GNN , we use a constant learning rate; all other models use a cosine learning rate schedule. We consider KFAC as a strong baseline for the GNN as suggested by Izadi et al. (2020). We train the GNN in FP-32 so that KFAC performs stably. The search space for the random search can be found in Table 5 in Appendix B. |