Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning
Authors: Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, Andrew Gordon Wilson
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide extensive experimental results to demonstrate the advantages of c SG-MCMC in sampling from multimodal distributions, including Bayesian neural networks and uncertainty estimation on several large and challenging datasets such as Image Net. |
| Researcher Affiliation | Collaboration | Ruqi Zhang Cornell University rz297@cornell.edu Chunyuan Li Microsoft Research, Redmond chunyl@microsoft.com Jianyi Zhang Duke University jz318@duke.edu Changyou Chen University at Buffalo, SUNY changyou@buffalo.edu Andrew Gordon Wilson New York University andrewgw@cims.nyu.edu |
| Pseudocode | Yes | Algorithm 1 Cyclical SG-MCMC. |
| Open Source Code | Yes | We release code at https://github.com/ruqizhang/csgmcmc. |
| Open Datasets | Yes | We demonstrate the effectiveness of c SG-MCMC on Bayesian neural networks for classification on CIFAR-10 and CIFAR-100. We consider Bayesian logistic regression (BLR) on three real-world datasets from the UCI repository: Australian (15 covariates, 690 data points), German (25 covariates, 1000 data points) and Heart (14 covariates, 270 data points). We further study different learning algorithms on a large-scale dataset, Image Net. We train a three-layer MLP model on the standard MNIST train dataset until convergence using different algorithms, and estimate the entropy of the predictive distribution on the not MNIST dataset (Bulatov, 2011). |
| Dataset Splits | No | The paper mentions training and testing on datasets but does not specify a validation dataset split or provide exact percentages/counts for such splits. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running experiments, such as specific CPU or GPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions methods and algorithms (e.g., SGLD, SGHMC, Snapshot) but does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA x.x). |
| Experiment Setup | Yes | We set M = 4 and α0 = 0.5 for c SGLD, c SGHMC and Snapshot. The proportion hyper-parameter β =0.8 and 0.94 for CIFAR-10 and CIFAR-100, respectively. We collect 3 samples per cycle. We use a Res Net-18 (He et al., 2016) and run all algorithms for 200 epochs. For the traditional SG-MCMC methods, we thus avoid noise injection for the first 150 epochs of training (corresponding to the zero temperature limit of SGLD and SGHMC), and resume SGMCMC as usual (with noise) for the last 50 epochs. We collect 20 samples for the MCMC methods and average their predictions in testing. For both c SGLD and c SGHMC, M = 100, β = 0.01. For c SGLD, α0N = 1.2, 0.5, 1.5 for Austrilian, German and Hear respectively. For c SGHMC α0N = 0.5, 0.3, 1.0 for Austrilian, German and Hear respectively. For SG-MCMC, the stepsize is a for the first 5000 iterations and then switch to the decay schedule (2) with b = 0, γ = 0.55. a N = 1.2, 0.5, 1.5 for Austrilian, German and Hear respectively for SGLD and a N = 0.5, 0.3, 1.0 for Austrilian, German and Hear respectively for SGHMC. η = 0.5 in c SGHMC and SGHMC. For SG-MCMC, the stepsize decays from 0.1 to 0.001 for the first 150 epochs and then switch to the decay schedule (2) with a = 0.01, b = 0 and γ = 0.5005. η = 0.9 in c SGHMC, Snapshot-SGDM and SGHMC. For both c SG-MCMC and Snapshot, M = 4. β = 0.8 in c SG-MCMC. α0N = 0.01 and 0.008 for c SGLD and c SGHMC respectively. For SG-MCMC, the stepsize is a for the first 50 iterations and then switch to the decay schedule (2) with b = 0, γ = 0.5005. a N = 0.01 for SGLD and a N = 0.008 for SGHMC. η = 0.5 in c SGHMC, Snapshot-SGDM and SGHMC. |