Bayesian Deep Ensembles via the Neural Tangent Kernel

Authors: Bobby He, Balaji Lakshminarayanan, Yee Whye Teh

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, using finite width NNs we demonstrate that our Bayesian deep ensembles faithfully emulate the analytic posterior predictive when available, and can outperform standard deep ensembles in various out-of-distribution settings, for both regression and classification tasks. and 4 Experiments
Researcher Affiliation Collaboration Bobby He Department of Statistics University of Oxford bobby.he@stats.ox.ac.uk Balaji Lakshminarayanan Google Research Brain team balajiln@google.com Yee Whye Teh Department of Statistics University of Oxford y.w.teh@stats.ox.ac.uk
Pseudocode Yes Algorithm 1 NTKGP-param ensemble
Open Source Code Yes Code for this experiment is available at: https://github.com/bobby-he/bayesian-ntk.
Open Datasets Yes Flight Delays dataset [43], MNIST vs Not MNIST, CIFAR-10 vs SVHN
Dataset Splits No In order to obtain probabilistic predictions, we temperature scale our trained ensemble predictions with cross-entropy loss on a held-out validation set and tuned using the validation accuracy on a small set of values around the He initialisation. No specific split percentages or counts are provided for the validation set.
Hardware Specification No No specific hardware details (like GPU/CPU models, memory, or cloud instances) are mentioned for running experiments.
Software Dependencies No init( ) will be standard parameterisation initialisation in the JAX library Neural Tangents [38] unless stated otherwise. No specific version numbers for JAX or Neural Tangents are provided.
Experiment Setup Yes For each ensemble method, we use MLP baselearners with two hidden layers of width 512, and erf activation. and The weight parameter initialisation variance σ2 W is tuned using the validation accuracy on a small set of values around the He initialisation, σ2 W =2, [44] for all classification experiments. and baselearners taking the Myrtle-10 CNN architecture [40] of channel-width 100.