A Simple Baseline for Bayesian Uncertainty in Deep Learning
Authors: Wesley J. Maddox, Pavel Izmailov, Timur Garipov, Dmitry P. Vetrov, Andrew Gordon Wilson
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically find that SWAG approximates the shape of the true posterior, in accordance with results describing the stationary distribution of SGD iterates. Moreover, we demonstrate that SWAG performs well on a wide variety of tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including MC dropout, KFAC Laplace, SGLD, and temperature scaling. |
| Researcher Affiliation | Collaboration | 1 New York University 2 Samsung AI Center Moscow 3 Samsung-HSE Laboratory, National Research University Higher School of Economics |
| Pseudocode | Yes | Algorithm 1 Bayesian Model Averaging with SWAG |
| Open Source Code | Yes | We release Py Torch code at https://github.com/wjmaddox/swa_gaussian. |
| Open Datasets | Yes | We conduct a thorough empirical evaluation of SWAG... on CIFAR-10, CIFAR-100 and Image Net ILSVRC-2012 [45]. We next apply SWAG to an LSTM network on language modeling tasks on Penn Treebank and Wiki Text-2 datasets. |
| Dataset Splits | Yes | We report test and validation perplexities for different methods and datasets in Table 1. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or memory specifications) used for running its experiments. |
| Software Dependencies | No | The paper states 'For all the methods we use our implementations in Py Torch (see Appendix 8),' but it does not specify any version numbers for PyTorch or other software dependencies. |
| Experiment Setup | Yes | We train all networks for 300 epochs, starting to collect models for SWA and SWAG approximations once per epoch after epoch 160. For SWAG, K-FAC Laplace, and Dropout we use 30 samples at test time. Appendix 8.1 states: 'For all methods, we train models using SGD with momentum for 300 epochs, batch size of 128 and weight decay of 5e-4. We use a learning rate of 0.01 for the last 140 epochs of training.' |