Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels

Authors: Alexander Immer, Tycho F. A. Van Der Ouderaa, Mark Van Der Wilk, Gunnar Ratsch, Bernhard Schölkopf

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Experiments We experimentally validate the proposed estimators on various settings of marginal-likelihood-based hyperparameter optimization for deep learning. Overall, we find that the lower bounds using subsets of data or outputs often provide a better trade-off between performance and computational or memory complexity. In particular, they remain relatively tight even when applied only on small subsets of data and outputs. Therefore, they can greatly accelerate hyperparameter optimization with Laplace approximations, making marginal-likelihood optimization possible at larger scale.
Researcher Affiliation Academia 1Department of Computer Science, ETH Zurich, Switzerland 2Max Planck Institute for Intelligent Systems, T ubingen, Germany 3Imperial College London, UK.
Pseudocode Yes Algorithm 1 Stochastic Marginal Likelihood Estimate
Open Source Code Yes Code: github.com/Alex Immer/ntk-marglik
Open Datasets Yes MNIST (Le Cun & Cortes, 2010), CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky et al.), Tiny Imagenet dataset (Le & Yang, 2015)
Dataset Splits No Bayesian model selection, where we consider hyperparameters and neural network weights jointly as part of a probabilistic model, is amenable to gradient-based optimization and also does not require any validation data (Mac Kay, 2003).
Hardware Specification Yes The experiments are run on an internal compute cluster with different NVIDIA GPUs. The timing experiments are run on a single A100 sequentially to ensure comparability. We therefore use NVIDIA A100 GPUs with 80GB memory to make the bounds as tight as possible and improve performance.
Software Dependencies No For our implementation, we modify and extend the asdl library (Osawa, 2021) that offers fast computation of KFAC and NTK, as well as laplace-torch (Daxberger et al., 2021) for the marginal likelihood approximations.
Experiment Setup Yes For network parameters we use a learning rate of 10 3 and decay it to 10 9 using cosine decay and use a batch size of 250. The invariance and prior learning hyperparameters follow the settings of Immer et al. (2022b): 10 epochs burnin and then update hyperparameters every epoch with learning rates 0.1 and 0.05 for prior precision and invariance parameters, respectively. Both are decayed by a factor of 10 using cosine decay.