Variational Bayesian Last Layers
Authors: James Harrison, John Willes, Jasper Snoek
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally investigate VBLLs, and show that they improve predictive accuracy, calibration, and out of distribution detection over baselines across both regression and classification. |
| Researcher Affiliation | Collaboration | James Harrison1, John Willes2, Jasper Snoek1 1Google Deep Mind, 2Vector Institute jamesharrison@google.com, john.willes@vectorinstitute.ai, jsnoek@google.com |
| Pseudocode | Yes | A detailed procedure for training the regression model with point features is shown in Algorithm 1. |
| Open Source Code | Yes | We release an easy-to-use package providing efficient VBLL implementations in Py Torch. |
| Open Datasets | Yes | We investigate the performance of the regression VBLL models on UCI regression datasets (Dua & Graff, 2017)... To evaluate performance of VBLL models in classification, we train the discriminative (D-VBLL) and generative (G-VBLL) classification models on the CIFAR-10 and CIFAR-100 image classification task. |
| Dataset Splits | Yes | For all deterministic feature experiments, we ran 20 seeds. For each seed, we split the data in to train/val/test sets (0.72/0.18/0.1 of the data respectively). |
| Hardware Specification | Yes | We compare these models on CIFAR-10 training on a NVIDIA T4 GPU |
| Software Dependencies | No | The Adam W optimizer is used for all models. We release an easy-to-use package providing efficient VBLL implementations in Py Torch. scikit-learn (Pedregosa et al., 2011). This does not provide specific version numbers for the software dependencies like PyTorch or scikit-learn. |
| Experiment Setup | Yes | For VBLLs, we used a N(0, I) last layer mean prior and a W 1(1, 1) noise covariance prior. For all experiments, we use the same MLP used in Watson et al. (2021) consisting of two layers of 50 hidden units each (not counting the last layer). For all datasets we matched Watson et al. (2021) and used a batch size of 32, other than the POWER dataset for which we used a batch size of 256 to accelerate training. ... All results shown in the body of the paper are for leaky Re LU activations. For all experiments, a fixed learning rate of 0.001 was used with the Adam W optimizer (Loshchilov & Hutter, 2017). A default weight decay of 0.01 was used for all experiments. We clipped gradients with a max magnitude of 1.0. |