Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning

Authors: Alexander Immer, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, Khan Mohammad Emtiyaz

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare our method to model selection by crossvalidation and also to manually selected models. Depending on the problem size, we use different variants of the Laplace GGN approximation. On smaller scale UCI (Dua & Graff, 2017) and toy examples, we use the full GGN and EF determinants and the Kronecker-factored approximation. On larger problems, we use the Kronecker-factored and the diagonal Laplace-GGN and EF approximations. We benchmark our online model selection against crossvalidation on standard image classification datasets (MNIST, FMNIST, CIFAR-10, CIFAR-100) and use the resulting marginal likelihood estimate to compare architectures. We compare fully-connected (MLP), convolutional (CNN), and residual (Res Net) networks. Our algorithm is robust to the selection of its hyperparameters (F, K, B, γ, see Sec. 3 and App. C.6).
Researcher Affiliation Academia 1Department of Computer Science, ETH Zurich, Switzerland 2Max Planck ETH Center for Learning Systems (CLS) 3Max Planck Institute for Intelligent Systems, Germany 4University of Cambridge, UK 5RIKEN Center for Advanced Intelligence Project, Japan
Pseudocode Yes Algorithm 1 Marginal likelihood based training
Open Source Code No The paper does not include any explicit statements about releasing code or links to a code repository for their method.
Open Datasets Yes On smaller scale UCI (Dua & Graff, 2017) and toy examples, we use the full GGN and EF determinants and the Kronecker-factored approximation. We benchmark our online model selection against crossvalidation on standard image classification datasets (MNIST, FMNIST, CIFAR-10, CIFAR-100) and use the resulting marginal likelihood estimate to compare architectures.
Dataset Splits No Marginal-likelihood based model-selection, even though promising, is rarely used in deep learning due to estimation difficulties. Instead, most approaches rely on validation data, which may not be readily available. In this work, we present a scalable marginal-likelihood estimation method to select both hyperparameters and network architectures, based on the training data alone.
Hardware Specification No The paper does not specify the hardware used for running the experiments (e.g., GPU models, CPU types, memory details).
Software Dependencies No The paper mentions software components like 'ADAM (Kingma & Ba, 2015)', 'SGD', 'batchnorm', and 'fixup', but does not provide specific version numbers for any of these or other libraries/packages.
Experiment Setup Yes We run Alg. 1 for two different architectures (1 and 3 hidden layers, 50 neurons per layer) with a step size of 0.01 for parameters and hyperparameters and recompute the marginal likelihood after each epoch (F = 1) and with no burn-in (B = 0) for 1000 epochs with K = 1 hyperparameter updates per step. We run Alg. 1 for 10,000 epochs until convergence with the standard learning rate of ADAM for both hyperparameters and parameters, and set frequency F = 1, K = 1 hyperparameter gradient steps, and do not use burn-in. Here, we use our online model selection step every F = 10 epochs for K = 100 hyperparameter steps without burn-in and with step size γ = 1 to keep computational overhead low. We optimize the network parameters using ADAM except for the Res Net experiments, where we follow common practice and use SGD with momentum of 0.9. On the image classification datasets, train for 300 epochs in total and decay the learning rate by a factor of 0.1 (He et al., 2016) after 150, 225, and 275 epochs, respectively.