Bayesian Deep Learning and a Probabilistic Perspective of Generalization
Authors: Andrew G. Wilson, Pavel Izmailov
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose Multi SWAG, a method inspired by deep ensembles, which marginalizes within basins of attraction achieving improved performance, with a similar training time. We then investigate the properties of priors over functions induced by priors over the weights of neural networks, showing that they have reasonable inductive biases, and connect these results to tempering. We also show that the mysterious generalization properties recently presented in Zhang et al. [51] can be understood by reasoning about prior distributions over functions, and are not specific to neural networks. Indeed, we show Gaussian processes can also perfectly fit images with random labels, yet generalize on the noise-free problem. These results are a consequence of large support but reasonable inductive biases for common problem settings. We further show that while Bayesian neural networks can fit the noisy datasets, the marginal likelihood has much better support for the noise free datasets, in line with Figure 2. We additionally show that the multimodal marginalization in Multi SWAG alleviates double descent, so as to achieve monotonic improvements in performance with model flexibility, in line with our perspective of generalization. Multi SWAG also provides significant improvements in both accuracy and NLL over SGD training and unimodal marginalization. |
| Researcher Affiliation | Academia | Andrew Gordon Wilson New York University Pavel Izmailov New York University |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found in the paper. |
| Open Source Code | Yes | We provide code at https://github.com/izmailovpavel/understandingbdl. |
| Open Datasets | Yes | We next evaluate Multi SWAG under distribution shift on the CIFAR-10 dataset [21]... We also study the properties of the induced distribution over functions... on objects of different MNIST classes. |
| Dataset Splits | Yes | Next, we evaluate Multi SWAG under distribution shift on the CIFAR-10 dataset [21], replicating the setup in Ovadia et al. [38]. |
| Hardware Specification | No | No specific hardware details such as GPU models, CPU models, or memory specifications used for the experiments were provided in the paper. |
| Software Dependencies | No | The paper mentions the use of the `hamiltorch` package [6] but does not provide specific version numbers for this or any other software dependencies. |
| Experiment Setup | Yes | In Figure 4 we show the negative log-likelihood as a function of the number of independently trained models for a Preactivation Res Net-20 on CIFAR-10 corrupted with Gaussian blur with varying levels of intensity... To test this hypothesis, we evaluate Multi SWAG, SWAG and standard SGD with Res Net-18 models of varying width, following Nakkiran et al. [33], measuring both error and negative log likelihood (NLL). |