Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning
Authors: Alexander Immer, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, Khan Mohammad Emtiyaz
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare our method to model selection by crossvalidation and also to manually selected models. Depending on the problem size, we use different variants of the Laplace GGN approximation. On smaller scale UCI (Dua & Graff, 2017) and toy examples, we use the full GGN and EF determinants and the Kronecker-factored approximation. On larger problems, we use the Kronecker-factored and the diagonal Laplace-GGN and EF approximations. We benchmark our online model selection against crossvalidation on standard image classification datasets (MNIST, FMNIST, CIFAR-10, CIFAR-100) and use the resulting marginal likelihood estimate to compare architectures. We compare fully-connected (MLP), convolutional (CNN), and residual (Res Net) networks. Our algorithm is robust to the selection of its hyperparameters (F, K, B, γ, see Sec. 3 and App. C.6). |
| Researcher Affiliation | Academia | 1Department of Computer Science, ETH Zurich, Switzerland 2Max Planck ETH Center for Learning Systems (CLS) 3Max Planck Institute for Intelligent Systems, Germany 4University of Cambridge, UK 5RIKEN Center for Advanced Intelligence Project, Japan |
| Pseudocode | Yes | Algorithm 1 Marginal likelihood based training |
| Open Source Code | No | The paper does not include any explicit statements about releasing code or links to a code repository for their method. |
| Open Datasets | Yes | On smaller scale UCI (Dua & Graff, 2017) and toy examples, we use the full GGN and EF determinants and the Kronecker-factored approximation. We benchmark our online model selection against crossvalidation on standard image classification datasets (MNIST, FMNIST, CIFAR-10, CIFAR-100) and use the resulting marginal likelihood estimate to compare architectures. |
| Dataset Splits | No | Marginal-likelihood based model-selection, even though promising, is rarely used in deep learning due to estimation difficulties. Instead, most approaches rely on validation data, which may not be readily available. In this work, we present a scalable marginal-likelihood estimation method to select both hyperparameters and network architectures, based on the training data alone. |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments (e.g., GPU models, CPU types, memory details). |
| Software Dependencies | No | The paper mentions software components like 'ADAM (Kingma & Ba, 2015)', 'SGD', 'batchnorm', and 'fixup', but does not provide specific version numbers for any of these or other libraries/packages. |
| Experiment Setup | Yes | We run Alg. 1 for two different architectures (1 and 3 hidden layers, 50 neurons per layer) with a step size of 0.01 for parameters and hyperparameters and recompute the marginal likelihood after each epoch (F = 1) and with no burn-in (B = 0) for 1000 epochs with K = 1 hyperparameter updates per step. We run Alg. 1 for 10,000 epochs until convergence with the standard learning rate of ADAM for both hyperparameters and parameters, and set frequency F = 1, K = 1 hyperparameter gradient steps, and do not use burn-in. Here, we use our online model selection step every F = 10 epochs for K = 100 hyperparameter steps without burn-in and with step size γ = 1 to keep computational overhead low. We optimize the network parameters using ADAM except for the Res Net experiments, where we follow common practice and use SGD with momentum of 0.9. On the image classification datasets, train for 300 epochs in total and decay the learning rate by a factor of 0.1 (He et al., 2016) after 150, 225, and 275 epochs, respectively. |