A Variational Analysis of Stochastic Gradient Algorithms

Authors: Stephan Mandt, Matthew Hoffman, David Blei

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4. Experiments We test our theoretical assumptions in section 4.1 and find good experimental evidence that they are correct. In this section, we compare against other approximate inference algorithms. In section 4.2 we show that constant SGD lets us optimize hyperparameters in a Bayesian model.Table 1: KL divergences between the posterior and stationary sampling distributions applied to the data sets discussed in Section 4.1. We compared constant SGD without preconditioning and with diagonal (-d) and full rank (-f) preconditioning against Stochastic Gradient Langevin Dynamics and Stochastic Gradient Fisher Scoring (SGFS) with diagonal (-d) and full rank (-f) preconditioning, and BBVI.
Researcher Affiliation Collaboration Stephan Mandt SM3976@COLUMBIA.EDU Columbia University, Data Science Institute, New York, USA, Matthew D. Hoffman MATHOFFM@ADOBE.COM Adobe Research, San Francisco, USA, David M. Blei DAVID.BLEI@COLUMBIA.EDU Columbia University, Departments of CS and Statistics, New York, USA
Pseudocode No The paper describes algorithms and updates (e.g., Eq. 5, Eq. 23) but does not provide any structured pseudocode or algorithm blocks (e.g., Algorithm 1, Algorithm box).
Open Source Code No The paper does not contain any explicit statement about releasing code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes Data. We first considered the following data sets. (1) The Wine Quality Data Set2,... (2) A data set of Protein Tertiary Structure3,... (3) The Skin Segmentation Data Set4,... We applied linear regression on data sets 1 and 2 and applied logistic regression on data set 3. ... Data. In all experiments, we applied this model to the MNIST dataset (60000 training examples, 10000 test examples, 784 features) and the cover type dataset (500000 training examples, 81012 testing examples, 54 features).
Dataset Splits No The paper mentions 'training examples' and 'test examples' for MNIST and cover type datasets. It also references 'validation loss' in Figure 3 and its discussion, implying the use of a validation set. However, it does not provide specific details on the split for the validation set (e.g., exact percentages or sample counts) for any of the datasets used, nor does it cite predefined validation splits.
Hardware Specification No The paper discusses the algorithms and experiments performed but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running these experiments.
Software Dependencies No The paper does not provide specific software dependency details such as library or solver names with their corresponding version numbers (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup Yes The quadratic regularizer was 1. The constant learning rate was adjusted according to Eq. 17. We rescaled the feature to unit length and used a mini-batch of sizes S = 100, S = 100 and S = 10000, respectively. For SG Fisher Scoring, we set the learning rate to ϵ of Eq. 17, while for Langevin dynamics we chose the largest rate that yielded stable results (ϵ = {10 3, 10 6, 10 5} for data sets 1, 2 and 3, respectively).