reproducibilityindex.ai

Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference

Authors: Geoffrey Roeder, Yuhuai Wu, David K. Duvenaud

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efficacy of this trick through experimental results on MNIST and Omniglot datasets using variational and importance-weighted autoencoders. 6 Experiments Experimental Setup Because we follow the experimental setup of Burda et al. [2015], we review it briefly here. Both benchmark datasets are composed of 28 28 binarized images. The MNIST dataset was split into 60, 000 training and 10, 000 test examples. The Omniglot dataset was split into 24, 345 training and 8070 test examples.
Researcher Affiliation	Academia	Geoffrey Roeder University of Toronto roeder@cs.toronto.edu Yuhuai Wu University of Toronto ywu@cs.toronto.edu David Duvenaud University of Toronto duvenaud@cs.toronto.edu
Pseudocode	Yes	Alg. 1 Standard ELBO Gradient, Alg. 2 Path Derivative ELBO Gradient, Alg. 3 Path Derivative Mixture ELBO Gradient, Alg. 4 IWAE ELBO Gradient
Open Source Code	Yes	1See https://github.com/geoffroeder/iwae
Open Datasets	Yes	Both benchmark datasets are composed of 28 28 binarized images. The MNIST dataset was split into 60, 000 training and 10, 000 test examples. The Omniglot dataset was split into 24, 345 training and 8070 test examples. MNIST, a dataset of handwritten digits [Le Cun et al., 1998], and Omniglot, a dataset of handwritten characters from many different alphabets [Lake, 2014].
Dataset Splits	No	The paper specifies training and test splits for MNIST (60,000 training, 10,000 test) and Omniglot (24,345 training, 8070 test) but does not explicitly mention a separate validation dataset split.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions software like TensorFlow, Theano, Autograd, and Torch by name and publication year, but does not specify concrete version numbers for these or other libraries used in the experiments.
Experiment Setup	Yes	Each model used Xavier initialization [Glorot and Bengio, 2010] and trained using Adam with parameters β1 = 0.9, β2 = 0.999, and ϵ = 1e 4 with 20 observations per minibatch [Kingma and Ba, 2015]. We compared against both architectures reported in Burda et al. [2015]. The first has one stochastic layer with 50 hidden units, encoded using two fully-connected layers of 200 neurons each, using a tanh nonlinearity throughout. The second architecture is two stochastic layers: the first stochastic layer encodes the observations, with two fully-connected layers of 200 hidden units each, into 100 dimensional outputs. The output is used as the parameters of diagonal Gaussian. The second layer takes samples from this Gaussian and passes them through two fully-connected layers of 100 hidden units each into 50 dimensions.