reproducibilityindex.ai

Variational Bayesian Last Layers

Authors: James Harrison, John Willes, Jasper Snoek

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally investigate VBLLs, and show that they improve predictive accuracy, calibration, and out of distribution detection over baselines across both regression and classiﬁcation.
Researcher Affiliation	Collaboration	James Harrison1, John Willes2, Jasper Snoek1 1Google Deep Mind, 2Vector Institute jamesharrison@google.com, john.willes@vectorinstitute.ai, jsnoek@google.com
Pseudocode	Yes	A detailed procedure for training the regression model with point features is shown in Algorithm 1.
Open Source Code	Yes	We release an easy-to-use package providing efﬁcient VBLL implementations in Py Torch.
Open Datasets	Yes	We investigate the performance of the regression VBLL models on UCI regression datasets (Dua & Graff, 2017)... To evaluate performance of VBLL models in classiﬁcation, we train the discriminative (D-VBLL) and generative (G-VBLL) classiﬁcation models on the CIFAR-10 and CIFAR-100 image classiﬁcation task.
Dataset Splits	Yes	For all deterministic feature experiments, we ran 20 seeds. For each seed, we split the data in to train/val/test sets (0.72/0.18/0.1 of the data respectively).
Hardware Specification	Yes	We compare these models on CIFAR-10 training on a NVIDIA T4 GPU
Software Dependencies	No	The Adam W optimizer is used for all models. We release an easy-to-use package providing efﬁcient VBLL implementations in Py Torch. scikit-learn (Pedregosa et al., 2011). This does not provide specific version numbers for the software dependencies like PyTorch or scikit-learn.
Experiment Setup	Yes	For VBLLs, we used a N(0, I) last layer mean prior and a W 1(1, 1) noise covariance prior. For all experiments, we use the same MLP used in Watson et al. (2021) consisting of two layers of 50 hidden units each (not counting the last layer). For all datasets we matched Watson et al. (2021) and used a batch size of 32, other than the POWER dataset for which we used a batch size of 256 to accelerate training. ... All results shown in the body of the paper are for leaky Re LU activations. For all experiments, a ﬁxed learning rate of 0.001 was used with the Adam W optimizer (Loshchilov & Hutter, 2017). A default weight decay of 0.01 was used for all experiments. We clipped gradients with a max magnitude of 1.0.