reproducibilityindex.ai

Studying K-FAC Heuristics by Viewing Adam through a Second-Order Lens

Authors: Ross M Clarke, José Miguel Hernández-Lobato

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Adam QLR on a range of regression and classiﬁcation tasks at various scales and hyperparameter tuning methodologies, concluding K-FAC s adaptive heuristics are of variable standalone general effectiveness, and ﬁnding an untuned Adam QLR setting can achieve comparable performance vs runtime to tuned benchmarks. 4. Experiments
Researcher Affiliation	Academia	Ross M. Clarke 1 José Miguel Hernández-Lobato 1 1University of Cambridge. Correspondence to: Ross M. Clarke <rmc78@cam.ac.uk>.
Pseudocode	Yes	Algorithm 1 Adam (Kingma & Ba, 2015) Algorithm 2 Adam QLR
Open Source Code	Yes	Code for all our experiments is available at https://github.com/ rmclarke/Adam Through ASecond Order Lens. We describe our algorithm fully in Section 3, provide full source code to the reviewers and will publish this code to the community after deanonymisation.
Open Datasets	Yes	Rosenbrock (1960) Function, UCI Energy (Tsanas & Xifara, 2012), UCI Protein (Rana, 2013), Fashion-MNIST (Xiao et al., 2017), SVHN (Netzer et al., 2011), CIFAR-10 (Krizhevsky, 2009), Penn Treebank (Marcus, Mitchell P. et al., 1999; Marcus et al., 1999). Table 4: Licences under which we use datasets in this work
Dataset Splits	Yes	otherwise, we separate the standard test set, randomly choose 1/6 (Fashion-MNIST and SVHN) or 1/10 (CIFAR-10) of the remaining data to form a validation set, and use cross-entropy loss. All hyperparameter tuning uses ASHA (Li et al., 2020) over 200 random initialisations, targeting a ﬁxed number of training epochs, subject to a maximum runtime of 15 minutes (only reached for CIFAR-10; see Appendix B.1.4 for experiments using runtime as the primary constraint).
Hardware Specification	Yes	Our experiments were performed on one of the two sets of hardware shown in Table 3. All runtime comparisons were performed on like-for-like hardware. We make use of GPU acceleration throughout, with the JAX (Bradbury et al., 2018), Haiku (Hennigan et al., 2020) and KFAC-JAX (Botev & Martens, 2022) libraries, along with various related components of the Deep Mind JAX Ecosystem (Babuschkin et al., 2020). Table 3: System conﬁgurations used to run our experiments. Type CPU GPU (NVIDIA) Python JAX CUDA cu DNN Consumer Desktop Intel Core i7-3930K RTX 2080GTX 3.10.11 0.3.25 11.4 8.05 Local Cluster Intel Core i9-10900X RTX 2080GTX 3.10.11 0.3.25 11.8 8.05
Software Dependencies	Yes	We make use of GPU acceleration throughout, with the JAX (Bradbury et al., 2018), Haiku (Hennigan et al., 2020) and KFAC-JAX (Botev & Martens, 2022) libraries, along with various related components of the Deep Mind JAX Ecosystem (Babuschkin et al., 2020). Table 3: System conﬁgurations used to run our experiments. Type CPU GPU (NVIDIA) Python JAX CUDA cu DNN Consumer Desktop Intel Core i7-3930K RTX 2080GTX 3.10.11 0.3.25 11.4 8.05 Local Cluster Intel Core i9-10900X RTX 2080GTX 3.10.11 0.3.25 11.8 8.05
Experiment Setup	Yes	Except for the Rosenbrock Function and (Untuned) variants, we also tune a batch size over {50, 100, 200, 400, 800, 1 600, 3 200}. All hyperparameter tuning uses ASHA (Li et al., 2020) over 200 random initialisations, targeting a ﬁxed number of training epochs, subject to a maximum runtime of 15 minutes (only reached for CIFAR-10; see Appendix B.1.4 for experiments using runtime as the primary constraint). Table 1: Hyperparameter search spaces for Section 4 Table 2: Optimal hyperparameters used to produce the results of Section 4, Appendix B.1.2 and Appendix B.3