Bayesian Sampling Using Stochastic Gradient Thermostats

Authors: Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, Hartmut Neven

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Section 5 compares our method with previous methods on synthetic and real world machine learning applications.Figure 2: The samples on ρ(θ) and the mean kinetic energy over iterations K(p) with ξ = 1 (1st), ξ = 10 (2nd), ξ = 0.1 (3rd), and the SGNHT (4th). The first three do not use a thermostat. The fourth column shows that the SGNHT method is able to sample accurately and maintains the mean kinetic energy with unknown noise.In the following machine learning experiments, we used a reformulation of (5) and (6) similar to [5], by letting u = p h, η = h2, α = ξh and a = Ah. The resulting Algorithm 2 is provided in Appendix F. In [5], SGHMC has been extensively compared with SGLD, SGD and SGD-momentum. Our experiments will focus on comparing SGHMC and SGNHT. Details of the experiment settings are described below. The test results over various parameters are reported in Figure 3.
Researcher Affiliation Collaboration Nan Ding Google Inc. dingnan@google.com Youhan Fang Purdue University yfang@cs.purdue.edu Ryan Babbush Google Inc. babbush@google.com Changyou Chen Duke University cchangyou@gmail.com Robert D. Skeel Purdue University skeel@cs.purdue.edu Hartmut Neven Google Inc. neven@google.com
Pseudocode Yes Algorithm 1: Stochastic Gradient Nos e-Hoover Thermostat Input: Parameters h, A. Initialize θ(0) Rn, p(0) N(0, I), and ξ(0) = A ; for t = 1, 2, . . . do Evaluate U(θ(t 1)) from (2) ; p(t) = p(t 1) ξ(t 1) p(t 1) h U(θ(t 1))h + 2A N(0, h); θ(t) = θ(t 1) + p(t) h; ξ(t) = ξ(t 1) + ( 1 n p (t) p(t) 1)h; end
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We first evaluate the benchmark MNIST dataset, using the Bayesian Neural Network (BNN) as in [5].The Movielens ml-1m dataset and the Netflix dataset, using the Bayesian probabilistic matrix factorization (BPMF) model [21].We evaluate our method on the ICML dataset using Latent Dirichlet Allocation [4].
Dataset Splits Yes The MNIST dataset contains 50,000 training examples, 10,000 validation examples, and 10,000 test examples.Each dataset is partitioned into training (80%) and testing (20%), and the training set is further partitioned for 5-fold cross validation.We used 80% documents for 5-fold cross validation and the remaining 20% for testing.
Hardware Specification No The paper describes experiments on various datasets but does not provide specific details on the hardware used, such as GPU or CPU models.
Software Dependencies No The paper mentions 'We implement our model in PyTorch' in Appendix F, but it does not specify any version numbers for PyTorch or other software dependencies.
Experiment Setup Yes To show our algorithm being able to handle large stochastic gradient noise due to small minibatch, we chose the minibatch of size 20. Each algorithm is run for a total number of 50k iterations with burn-in of the first 10k iterations. The hidden layer size is 100, parameter a is from {0.001, 0.01} and η from {2, 4, 6, 8} 10 7.The base number is chosen as 10, parameter a is from {0.01, 0.1} and η from {2, 4, 6, 8} 10 7. Each minibatch contains 400 ratings for Movielens1M and 40k ratings for Netflix. Each algorithm is run for 100k iterations with burn-in of the first 20k iterations.The Dirichlet prior parameter for the topic distribution for each document is set to 0.1 and the Gaussian prior for θkw is set as N(0.1, 1). Each minibatch contains 100 documents. Each algorithm is run for 50k iterations with the first 10k iterations as burn-in. Topic number is 30, parameter a is from {0.01, 0.1} and η from {2, 4, 6, 8} 10 5.