Practical Deep Learning with Bayesian Principles

Authors: Kazuki Osawa, Siddharth Swaroop, Mohammad Emtiyaz Khan, Anirudh Jain, Runa Eschenhagen, Richard E. Turner, Rio Yokota

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate practical training of deep networks with natural-gradient variational inference. By applying techniques such as batch normalisation, data augmentation, and distributed training, we achieve similar performance in about the same number of epochs as the Adam optimiser, even on large datasets such as Image Net. Importantly, the benefits of Bayesian principles are preserved: predictive probabilities are well-calibrated, uncertainties on outof-distribution data are improved, and continual-learning performance is boosted. and 4 Experiments In this section, we present experiments on fitting several deep networks on CIFAR-10 and Image Net.
Researcher Affiliation Academia 1 Tokyo Institute of Technology, Tokyo, Japan 2 University of Cambridge, Cambridge, UK 3 Indian Institute of Technology (ISM), Dhanbad, India 4 University of Osnabrück, Osnabrück, Germany 5 RIKEN Center for AI Project, Tokyo, Japan
Pseudocode Yes Algorithm 1: Variational Online Gauss Newton (VOGN) and Figure 2: A pseudo-code for our distributed VOGN algorithm is shown in Algorithm 1
Open Source Code Yes A Py Torch implementation1 is available as a plug-and-play optimiser. and 1 The code is available at https://github.com/team-approx-bayes/dl-with-bayes.
Open Datasets Yes CIFAR-10 [28] contains 10 classes with 50,000 images for training and 10,000 images for validation. For Image Net, we train with 1.28 million training examples and validate on 50,000 examples, classifying between 1,000 classes.
Dataset Splits Yes CIFAR-10 [28] contains 10 classes with 50,000 images for training and 10,000 images for validation. For Image Net, we train with 1.28 million training examples and validate on 50,000 examples, classifying between 1,000 classes.
Hardware Specification Yes We used a large minibatch size M = 4, 096 and parallelise them across 128 GPUs (NVIDIA Tesla P100).
Software Dependencies No The paper mentions 'Py Torch implementation' but does not provide specific version numbers for PyTorch or other software dependencies.
Experiment Setup Yes Batch normalisation: Batch Norm layers are inserted between neural network layers. They help stabilise each layer s input distribution by normalising the running average of the inputs mean and variance. In our VOGN implementation, we simply use the existing implementation with default hyperparameter settings. and We set by considering the specific DA techniques used. When training on CIFAR-10, the random cropping DA step involves first padding the 32x32 images to become of size 40x40, and then taking randomly selected 28x28 cropped images. We consider this as effectively increasing the dataset size by a factor of 5 (4 images for each corner, and one central image). The horizontal flipping DA step doubles the dataset size (one dataset of unflipped images, one for flipped images). Combined, this gives = 10. and The full set of hyperparameters is in Appendix D.