Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam

Authors: Mohammad Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, Akash Srivastava

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results confirm this and further suggest that the weight-perturbation in our algorithm could be useful for exploration in reinforcement learning and stochastic optimization.
Researcher Affiliation Academia 1RIKEN Center for Advanced Intelligence project, Tokyo, Japan 2University of British Columbia, Vancouver, Canada 3University of Oxford, Oxford, UK 4University of Edinburgh, Edinburgh, UK.
Pseudocode Yes Figure 1. Comparison of Adam (left) and one of our proposed method Vadam (right). Adam performs maximum-likelihood estimation while Vadam performs variational inference, yet the two pseudocodes differ only slightly (differences highlighted in red).
Open Source Code Yes The code to reproduce our results is available at https://github.com/emtiyaz/vadam.
Open Datasets Yes We use three datasets: a toy dataset (N = 60, D = 2), USPS-3vs5 (N = 1781, D = 256) and Breast-Cancer (N = 683, D = 10). Details are in Appendix I. We show results on the standard UCI benchmark. We repeat the experimental setup used in Gal & Ghahramani (2016).
Dataset Splits No We use the 20 splits of the data provided by Gal & Ghahramani (2016) for training and testing. The paper mentions training and testing splits but does not explicitly detail a validation split or its methodology.
Hardware Specification No Finally, we are thankful for the RAIDEN computing system at the RIKEN Center for AI Project, which we extensively used for our experiments. While a computing system is mentioned, no specific hardware components such as GPU/CPU models or memory details are provided.
Software Dependencies No The paper mentions various methods and tools like Adam optimizer, RMSprop, Ada Grad, and OpenAI Gym, but it does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Following their work, we use a neural network with one hidden layer, 50 hidden units, and Re LU activation functions. We use the 20 splits of the data provided by Gal & Ghahramani (2016) for training and testing. We use Bayesian optimization to select the prior precision λ and noise precision of the Gaussian likelihood. We consider the deep deterministic policy gradient (DDPG) method for the Half-Cheetah task using a two-layer neural networks with 400 and 300 Re LU hidden units (Lillicrap et al., 2015).