Studying K-FAC Heuristics by Viewing Adam through a Second-Order Lens
Authors: Ross M Clarke, José Miguel Hernández-Lobato
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Adam QLR on a range of regression and classification tasks at various scales and hyperparameter tuning methodologies, concluding K-FAC s adaptive heuristics are of variable standalone general effectiveness, and finding an untuned Adam QLR setting can achieve comparable performance vs runtime to tuned benchmarks. 4. Experiments |
| Researcher Affiliation | Academia | Ross M. Clarke 1 José Miguel Hernández-Lobato 1 1University of Cambridge. Correspondence to: Ross M. Clarke <rmc78@cam.ac.uk>. |
| Pseudocode | Yes | Algorithm 1 Adam (Kingma & Ba, 2015) Algorithm 2 Adam QLR |
| Open Source Code | Yes | Code for all our experiments is available at https://github.com/ rmclarke/Adam Through ASecond Order Lens. We describe our algorithm fully in Section 3, provide full source code to the reviewers and will publish this code to the community after deanonymisation. |
| Open Datasets | Yes | Rosenbrock (1960) Function, UCI Energy (Tsanas & Xifara, 2012), UCI Protein (Rana, 2013), Fashion-MNIST (Xiao et al., 2017), SVHN (Netzer et al., 2011), CIFAR-10 (Krizhevsky, 2009), Penn Treebank (Marcus, Mitchell P. et al., 1999; Marcus et al., 1999). Table 4: Licences under which we use datasets in this work |
| Dataset Splits | Yes | otherwise, we separate the standard test set, randomly choose 1/6 (Fashion-MNIST and SVHN) or 1/10 (CIFAR-10) of the remaining data to form a validation set, and use cross-entropy loss. All hyperparameter tuning uses ASHA (Li et al., 2020) over 200 random initialisations, targeting a fixed number of training epochs, subject to a maximum runtime of 15 minutes (only reached for CIFAR-10; see Appendix B.1.4 for experiments using runtime as the primary constraint). |
| Hardware Specification | Yes | Our experiments were performed on one of the two sets of hardware shown in Table 3. All runtime comparisons were performed on like-for-like hardware. We make use of GPU acceleration throughout, with the JAX (Bradbury et al., 2018), Haiku (Hennigan et al., 2020) and KFAC-JAX (Botev & Martens, 2022) libraries, along with various related components of the Deep Mind JAX Ecosystem (Babuschkin et al., 2020). Table 3: System configurations used to run our experiments. Type CPU GPU (NVIDIA) Python JAX CUDA cu DNN Consumer Desktop Intel Core i7-3930K RTX 2080GTX 3.10.11 0.3.25 11.4 8.05 Local Cluster Intel Core i9-10900X RTX 2080GTX 3.10.11 0.3.25 11.8 8.05 |
| Software Dependencies | Yes | We make use of GPU acceleration throughout, with the JAX (Bradbury et al., 2018), Haiku (Hennigan et al., 2020) and KFAC-JAX (Botev & Martens, 2022) libraries, along with various related components of the Deep Mind JAX Ecosystem (Babuschkin et al., 2020). Table 3: System configurations used to run our experiments. Type CPU GPU (NVIDIA) Python JAX CUDA cu DNN Consumer Desktop Intel Core i7-3930K RTX 2080GTX 3.10.11 0.3.25 11.4 8.05 Local Cluster Intel Core i9-10900X RTX 2080GTX 3.10.11 0.3.25 11.8 8.05 |
| Experiment Setup | Yes | Except for the Rosenbrock Function and (Untuned) variants, we also tune a batch size over {50, 100, 200, 400, 800, 1 600, 3 200}. All hyperparameter tuning uses ASHA (Li et al., 2020) over 200 random initialisations, targeting a fixed number of training epochs, subject to a maximum runtime of 15 minutes (only reached for CIFAR-10; see Appendix B.1.4 for experiments using runtime as the primary constraint). Table 1: Hyperparameter search spaces for Section 4 Table 2: Optimal hyperparameters used to produce the results of Section 4, Appendix B.1.2 and Appendix B.3 |