Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference
Authors: Geoffrey Roeder, Yuhuai Wu, David K. Duvenaud
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of this trick through experimental results on MNIST and Omniglot datasets using variational and importance-weighted autoencoders. 6 Experiments Experimental Setup Because we follow the experimental setup of Burda et al. [2015], we review it briefly here. Both benchmark datasets are composed of 28 28 binarized images. The MNIST dataset was split into 60, 000 training and 10, 000 test examples. The Omniglot dataset was split into 24, 345 training and 8070 test examples. |
| Researcher Affiliation | Academia | Geoffrey Roeder University of Toronto roeder@cs.toronto.edu Yuhuai Wu University of Toronto ywu@cs.toronto.edu David Duvenaud University of Toronto duvenaud@cs.toronto.edu |
| Pseudocode | Yes | Alg. 1 Standard ELBO Gradient, Alg. 2 Path Derivative ELBO Gradient, Alg. 3 Path Derivative Mixture ELBO Gradient, Alg. 4 IWAE ELBO Gradient |
| Open Source Code | Yes | 1See https://github.com/geoffroeder/iwae |
| Open Datasets | Yes | Both benchmark datasets are composed of 28 28 binarized images. The MNIST dataset was split into 60, 000 training and 10, 000 test examples. The Omniglot dataset was split into 24, 345 training and 8070 test examples. MNIST, a dataset of handwritten digits [Le Cun et al., 1998], and Omniglot, a dataset of handwritten characters from many different alphabets [Lake, 2014]. |
| Dataset Splits | No | The paper specifies training and test splits for MNIST (60,000 training, 10,000 test) and Omniglot (24,345 training, 8070 test) but does not explicitly mention a separate validation dataset split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like TensorFlow, Theano, Autograd, and Torch by name and publication year, but does not specify concrete version numbers for these or other libraries used in the experiments. |
| Experiment Setup | Yes | Each model used Xavier initialization [Glorot and Bengio, 2010] and trained using Adam with parameters β1 = 0.9, β2 = 0.999, and ϵ = 1e 4 with 20 observations per minibatch [Kingma and Ba, 2015]. We compared against both architectures reported in Burda et al. [2015]. The first has one stochastic layer with 50 hidden units, encoded using two fully-connected layers of 200 neurons each, using a tanh nonlinearity throughout. The second architecture is two stochastic layers: the first stochastic layer encodes the observations, with two fully-connected layers of 200 hidden units each, into 100 dimensional outputs. The output is used as the parameters of diagonal Gaussian. The second layer takes samples from this Gaussian and passes them through two fully-connected layers of 100 hidden units each into 50 dimensions. |