Adversarial Distillation of Bayesian Neural Network Posteriors
Authors: Kuan-Chieh Wang, Paul Vicol, James Lucas, Li Gu, Roger Grosse, Richard Zemel
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a framework, Adversarial Posterior Distillation, to distill the SGLD samples using a Generative Adversarial Network (GAN). At test-time, samples are generated by the GAN. We show that this distillation framework incurs no loss in performance on recent BNN applications including anomaly detection, active learning, and defense against adversarial attacks. By construction, our framework distills not only the Bayesian predictive distribution, but the posterior itself. This allows one to compute quantities such as the approximate model variance, which is useful in downstream tasks. To our knowledge, these are the first results applying MCMC-based BNNs to the aforementioned applications. |
| Researcher Affiliation | Academia | 1University of Toronto, Toronto, Ontario, Canada 2Vector Institute, Toronto, Ontario, Canada. |
| Pseudocode | Yes | Algorithm 1 Offline APD; Algorithm 2 Online APD |
| Open Source Code | Yes | Implementation details can be found at https:// github.com/wangkua1/apd_public |
| Open Datasets | Yes | We used MNIST for our classification and anomaly detection experiments. |
| Dataset Splits | Yes | We trained on 50,000 examples, and reserved 10,000 from the standard training set as a fixed validation set. |
| Hardware Specification | No | The paper does not specify any hardware components such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using the 'foolbox library (Rauber et al., 2017)' but does not provide a specific version number for this or any other software dependency. |
| Experiment Setup | Yes | When training with SGD, we tuned the learning rate and weight decay on the validation set: we found the best values to be 0.05 and 0.001, respectively. (...) For SGLD, we did not use dropout, and the number of burn-in iterations and sampling interval were 500 and 20, respectively. The batch size for training was fixed at 100 for all methods. (...) We experimented with two fc NN architectures: fc NN1, with architecture 784-100-10 (79,510 parameters), and fc NN2, with architecture 784-400-400-10 (478,410 parameters). For APD, we used a 3-layer fc NN with 100 hidden units per layer for both our generator and discriminator, for all tasks. |