Studying K-FAC Heuristics by Viewing Adam through a Second-Order Lens

Authors: Ross M Clarke, José Miguel Hernández-Lobato

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Adam QLR on a range of regression and classification tasks at various scales and hyperparameter tuning methodologies, concluding K-FAC s adaptive heuristics are of variable standalone general effectiveness, and finding an untuned Adam QLR setting can achieve comparable performance vs runtime to tuned benchmarks. 4. Experiments
Researcher Affiliation Academia Ross M. Clarke 1 José Miguel Hernández-Lobato 1 1University of Cambridge. Correspondence to: Ross M. Clarke <rmc78@cam.ac.uk>.
Pseudocode Yes Algorithm 1 Adam (Kingma & Ba, 2015) Algorithm 2 Adam QLR
Open Source Code Yes Code for all our experiments is available at https://github.com/ rmclarke/Adam Through ASecond Order Lens. We describe our algorithm fully in Section 3, provide full source code to the reviewers and will publish this code to the community after deanonymisation.
Open Datasets Yes Rosenbrock (1960) Function, UCI Energy (Tsanas & Xifara, 2012), UCI Protein (Rana, 2013), Fashion-MNIST (Xiao et al., 2017), SVHN (Netzer et al., 2011), CIFAR-10 (Krizhevsky, 2009), Penn Treebank (Marcus, Mitchell P. et al., 1999; Marcus et al., 1999). Table 4: Licences under which we use datasets in this work
Dataset Splits Yes otherwise, we separate the standard test set, randomly choose 1/6 (Fashion-MNIST and SVHN) or 1/10 (CIFAR-10) of the remaining data to form a validation set, and use cross-entropy loss. All hyperparameter tuning uses ASHA (Li et al., 2020) over 200 random initialisations, targeting a fixed number of training epochs, subject to a maximum runtime of 15 minutes (only reached for CIFAR-10; see Appendix B.1.4 for experiments using runtime as the primary constraint).
Hardware Specification Yes Our experiments were performed on one of the two sets of hardware shown in Table 3. All runtime comparisons were performed on like-for-like hardware. We make use of GPU acceleration throughout, with the JAX (Bradbury et al., 2018), Haiku (Hennigan et al., 2020) and KFAC-JAX (Botev & Martens, 2022) libraries, along with various related components of the Deep Mind JAX Ecosystem (Babuschkin et al., 2020). Table 3: System configurations used to run our experiments. Type CPU GPU (NVIDIA) Python JAX CUDA cu DNN Consumer Desktop Intel Core i7-3930K RTX 2080GTX 3.10.11 0.3.25 11.4 8.05 Local Cluster Intel Core i9-10900X RTX 2080GTX 3.10.11 0.3.25 11.8 8.05
Software Dependencies Yes We make use of GPU acceleration throughout, with the JAX (Bradbury et al., 2018), Haiku (Hennigan et al., 2020) and KFAC-JAX (Botev & Martens, 2022) libraries, along with various related components of the Deep Mind JAX Ecosystem (Babuschkin et al., 2020). Table 3: System configurations used to run our experiments. Type CPU GPU (NVIDIA) Python JAX CUDA cu DNN Consumer Desktop Intel Core i7-3930K RTX 2080GTX 3.10.11 0.3.25 11.4 8.05 Local Cluster Intel Core i9-10900X RTX 2080GTX 3.10.11 0.3.25 11.8 8.05
Experiment Setup Yes Except for the Rosenbrock Function and (Untuned) variants, we also tune a batch size over {50, 100, 200, 400, 800, 1 600, 3 200}. All hyperparameter tuning uses ASHA (Li et al., 2020) over 200 random initialisations, targeting a fixed number of training epochs, subject to a maximum runtime of 15 minutes (only reached for CIFAR-10; see Appendix B.1.4 for experiments using runtime as the primary constraint). Table 1: Hyperparameter search spaces for Section 4 Table 2: Optimal hyperparameters used to produce the results of Section 4, Appendix B.1.2 and Appendix B.3