A Simple Baseline for Bayesian Uncertainty in Deep Learning

Authors: Wesley J. Maddox, Pavel Izmailov, Timur Garipov, Dmitry P. Vetrov, Andrew Gordon Wilson

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically find that SWAG approximates the shape of the true posterior, in accordance with results describing the stationary distribution of SGD iterates. Moreover, we demonstrate that SWAG performs well on a wide variety of tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including MC dropout, KFAC Laplace, SGLD, and temperature scaling.
Researcher Affiliation Collaboration 1 New York University 2 Samsung AI Center Moscow 3 Samsung-HSE Laboratory, National Research University Higher School of Economics
Pseudocode Yes Algorithm 1 Bayesian Model Averaging with SWAG
Open Source Code Yes We release Py Torch code at https://github.com/wjmaddox/swa_gaussian.
Open Datasets Yes We conduct a thorough empirical evaluation of SWAG... on CIFAR-10, CIFAR-100 and Image Net ILSVRC-2012 [45]. We next apply SWAG to an LSTM network on language modeling tasks on Penn Treebank and Wiki Text-2 datasets.
Dataset Splits Yes We report test and validation perplexities for different methods and datasets in Table 1.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or memory specifications) used for running its experiments.
Software Dependencies No The paper states 'For all the methods we use our implementations in Py Torch (see Appendix 8),' but it does not specify any version numbers for PyTorch or other software dependencies.
Experiment Setup Yes We train all networks for 300 epochs, starting to collect models for SWA and SWAG approximations once per epoch after epoch 160. For SWAG, K-FAC Laplace, and Dropout we use 30 samples at test time. Appendix 8.1 states: 'For all methods, we train models using SGD with momentum for 300 epochs, batch size of 128 and weight decay of 5e-4. We use a learning rate of 0.01 for the last 140 epochs of training.'