Bayesian Sampling Using Stochastic Gradient Thermostats
Authors: Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, Hartmut Neven
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Section 5 compares our method with previous methods on synthetic and real world machine learning applications.Figure 2: The samples on ρ(θ) and the mean kinetic energy over iterations K(p) with ξ = 1 (1st), ξ = 10 (2nd), ξ = 0.1 (3rd), and the SGNHT (4th). The first three do not use a thermostat. The fourth column shows that the SGNHT method is able to sample accurately and maintains the mean kinetic energy with unknown noise.In the following machine learning experiments, we used a reformulation of (5) and (6) similar to [5], by letting u = p h, η = h2, α = ξh and a = Ah. The resulting Algorithm 2 is provided in Appendix F. In [5], SGHMC has been extensively compared with SGLD, SGD and SGD-momentum. Our experiments will focus on comparing SGHMC and SGNHT. Details of the experiment settings are described below. The test results over various parameters are reported in Figure 3. |
| Researcher Affiliation | Collaboration | Nan Ding Google Inc. dingnan@google.com Youhan Fang Purdue University yfang@cs.purdue.edu Ryan Babbush Google Inc. babbush@google.com Changyou Chen Duke University cchangyou@gmail.com Robert D. Skeel Purdue University skeel@cs.purdue.edu Hartmut Neven Google Inc. neven@google.com |
| Pseudocode | Yes | Algorithm 1: Stochastic Gradient Nos e-Hoover Thermostat Input: Parameters h, A. Initialize θ(0) Rn, p(0) N(0, I), and ξ(0) = A ; for t = 1, 2, . . . do Evaluate U(θ(t 1)) from (2) ; p(t) = p(t 1) ξ(t 1) p(t 1) h U(θ(t 1))h + 2A N(0, h); θ(t) = θ(t 1) + p(t) h; ξ(t) = ξ(t 1) + ( 1 n p (t) p(t) 1)h; end |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We first evaluate the benchmark MNIST dataset, using the Bayesian Neural Network (BNN) as in [5].The Movielens ml-1m dataset and the Netflix dataset, using the Bayesian probabilistic matrix factorization (BPMF) model [21].We evaluate our method on the ICML dataset using Latent Dirichlet Allocation [4]. |
| Dataset Splits | Yes | The MNIST dataset contains 50,000 training examples, 10,000 validation examples, and 10,000 test examples.Each dataset is partitioned into training (80%) and testing (20%), and the training set is further partitioned for 5-fold cross validation.We used 80% documents for 5-fold cross validation and the remaining 20% for testing. |
| Hardware Specification | No | The paper describes experiments on various datasets but does not provide specific details on the hardware used, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions 'We implement our model in PyTorch' in Appendix F, but it does not specify any version numbers for PyTorch or other software dependencies. |
| Experiment Setup | Yes | To show our algorithm being able to handle large stochastic gradient noise due to small minibatch, we chose the minibatch of size 20. Each algorithm is run for a total number of 50k iterations with burn-in of the first 10k iterations. The hidden layer size is 100, parameter a is from {0.001, 0.01} and η from {2, 4, 6, 8} 10 7.The base number is chosen as 10, parameter a is from {0.01, 0.1} and η from {2, 4, 6, 8} 10 7. Each minibatch contains 400 ratings for Movielens1M and 40k ratings for Netflix. Each algorithm is run for 100k iterations with burn-in of the first 20k iterations.The Dirichlet prior parameter for the topic distribution for each document is set to 0.1 and the Gaussian prior for θkw is set as N(0.1, 1). Each minibatch contains 100 documents. Each algorithm is run for 50k iterations with the first 10k iterations as burn-in. Topic number is 30, parameter a is from {0.01, 0.1} and η from {2, 4, 6, 8} 10 5. |