High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models

Authors: Chunyuan Li, Changyou Chen, Kai Fan, Lawrence Carin

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two canonical models and their deep extensions demonstrate that the proposed scheme improves general Bayesian posterior sampling, particularly for deep models.
Researcher Affiliation Academia Chunyuan Li1, Changyou Chen1 , Kai Fan2 and Lawrence Carin1 1Department of Electrical and Computer Engineering, Duke University 2Computational Biology and Bioinformatics, Duke University
Pseudocode Yes From the update equations, SSI performs almost as efficiently as the Euler integrator. Furthermore, the splitting scheme for (2) is not unique. However, all of the schemes can be shown to have the same order of accuracy. In the following subsection, we show quantitatively that the SSI is more accurate than the Euler integrator in terms of approximation errors. To get an impression of how the SSI works, we illustrate it with a simple synthetic experiment.
Open Source Code No Appendix is at https://sites.google.com/site/chunyuan24. The provided link for the appendix is not a direct code repository and currently leads to a dead page.
Open Datasets Yes We first evaluate our method on the ICML dataset (Chen et al. 2015) using LDA. This dataset contains 765 documents from the abstracts of ICML proceedings from 2007 to 2011. ... We examine logistic regression (LR) on the a9a dataset (Lin, Weng, and Keerthi 2008). ... We evaluate FNN on the MNIST dataset for classification. ... We test the DPFA on a large dataset, Wikipedia, from which 10M randomly downloaded documents are used, using scripts provided in (Hoffman, Bach, and Blei 2010).
Dataset Splits Yes We used 80% of the documents (selected at random) for training and the remaining 20% for testing. ... The training and testing data consist of 32561 and 16281 data points, respectively... The data contains 60000 training examples and 10000 testing examples... Wikipedia, from which 10M randomly downloaded documents are used... 1K documents are randomly selected for testing and validation, respectively.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details needed to replicate the experiment.
Experiment Setup Yes We use the Rectified Linear Unit (Re LU) (Glorot, Bordes, and Bengio 2011) as gθℓin each layer. The number of hidden units for each layer is 100, D is set to 5, stepsize h is set to 10 4, 40 epochs are used. To reduce bias (Chen, Ding, and Carin 2015), h is decreased by half at epoch 20. We test the FNNs with depth {2, 3, 4}, respectively. ... The minibatch size is set to 10, and the Gaussian prior on the parameters is N(0, 10). A thinning interval of 50 is used, with burn-in 300, and 3 103 total iterations. ... The Dirichlet prior parameter for topic distribution for each document is set to 0.1. The number of topics is set to 30. We use perplexity (Blei, Ng, and Jordan 2003) to measure the quality of algorithms. To show the robustness of m SGNHT-S to stochastic gradient noise, we chose minibatch of size 5, and D in m SGNHTS is fixed as 0.75. ... The vocabulary size is 7702, and the minibatch size is set to 100, with one pass of the whole data in the experiments. We collect 300 posterior samples to calculate test perplexities, with a standard holdout technique. A threelayer DSBN is employed, with dimensions 128-64-32 (128 topics right above the data layer). Step sizes are chosen as 10 4 and 10 5, and parameter D = 40.