Stochastic Gradient MCMC with Stale Gradients

Authors: Changyou Chen, Nan Ding, Chunyuan Li, Yizhe Zhang, Lawrence Carin

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on synthetic data and deep neural networks validate our theory, demonstrating the effectiveness and scalability of SG-MCMC with stale gradients.
Researcher Affiliation Collaboration Dept. of Electrical and Computer Engineering, Duke University, Durham, NC, USA Google Inc., Venice, CA, USA {cc448,cl319,yz196,lcarin}@duke.edu; dingnan@google.com
Pseudocode Yes Algorithm 1 State update of SGHMC with the stale stochastic gradient θ ˆUτl(θ)
Open Source Code No The paper does not provide an explicit statement about open-sourcing the code for their specific S2G-MCMC implementation or a link to a code repository. It mentions using 'an MPI (message passing interface) extension of the popular Caffe package for deep learning [32]' but no specific code release for their contributions.
Open Datasets Yes We use the Adult dataset , a9a, with 32,561 training samples and 16,281 test samples. [...] http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html. [...] Le Net for MNIST We modify the standard Le Net to a Bayesian setting for the MNIST dataset. [...] Cifar10-Quick net for CIFAR10 [...] The CIFAR-10 dataset consists of 60,000 color images of size 32 32 in 10 classes, with 50,000 for training and 10,000 for testing.
Dataset Splits No The paper provides training and test split sizes for the Adult dataset ('32,561 training samples and 16,281 test samples') and CIFAR-10 ('50,000 for training and 10,000 for testing'), but does not explicitly mention a validation set split or methodology for it.
Hardware Specification Yes The algorithm is run on a cluster of five machines. Each machine is equipped with eight 3.60GHz Intel(R) Core(TM) i7-4790 CPU cores.
Software Dependencies No The paper mentions 'Caffe' and 'MPICH library' as software used, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes In all these models, zero mean and unit variance Gaussian priors are employed for the weights to capture weight uncertainties, an effective way to deal with overfitting [33]. We vary the number of servers S among {1, 3, 5, 7}, and the number of workers for each server from 1 to 9. [...] For simplicity, we use the default parameter setting specified in Caffe, with the additional parameter B in SGHMC (Algorithm 1) set to (1 m), where m is the moment variable defined in the SGD algorithm in Caffe.