Stick-Breaking Variational Autoencoders

Authors: Eric Nalisnick, Padhraic Smyth

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally demonstrate that the SB-VAE, and a semisupervised variant, learn highly discriminative latent representations that often outperform the Gaussian VAE s. We analyze the behavior of the three parametrizations of the SB-VAE and examine how they compare to the Gaussian VAE. We performed unsupervised and semi-supervised tasks on the following image datasets: Frey Faces2, MNIST, MNIST+rot, and Street View House Numbers3 (SVHN).
Researcher Affiliation Academia Eric Nalisnick Department of Computer Science University of California, Irvine enalisni@uci.edu Padhraic Smyth Department of Computer Science University of California, Irvine smyth@ics.uci.edu
Pseudocode No No pseudocode or algorithm blocks are present in the paper. The methodology is described using mathematical formulas and descriptive text.
Open Source Code Yes Complete implementation and optimization details can be found in the appendix and code repository5. Theano implementations available at https://github.com/enalisnick/stick-breaking_dgms.
Open Datasets Yes We performed unsupervised and semi-supervised tasks on the following image datasets: Frey Faces2, MNIST, MNIST+rot, and Street View House Numbers3 (SVHN). 2Available at http://www.cs.nyu.edu/~roweis/data.html 3Available at http://ufldl.stanford.edu/housenumbers/ 4Available at http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/Mnist Variations
Dataset Splits Yes Frey Faces was divided into {1500, 100, 300}, MNIST into {45000, 5000, 10000}, MNIST+rot into {70000, 10000, 20000}, and SVHN into {65000, 8257, 26032}.
Hardware Specification Yes All experiments were run on AWS G2.2XL instances.
Software Dependencies No The paper mentions "Theano implementations" and "Ada M (Kingma & Ba, 2014)" but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes All models were trained with minibatches of size 100 and using Ada M (Kingma & Ba, 2014) to set the gradient descent step size. For Ada M, we used α = 0.0003, b1 = 0.95, and b2 = 0.999 in all experiments. Early stopping was used during semi-supervised training with a look-ahead threshold of 30 epochs. For the MNIST datasets, λ = .375, and for SVHN, λ = .45. As for the model architectures, all experiments used Re LUs exclusively for hidden unit activations. The dimensionality / truncation-level of the latent variables was set at 50 for every experiment except Frey Faces. All weights were initialized by drawing from N(0, 0.001 1), and biases were set to zero to start. No regularization (dropout, weight decay, etc) was used, and only one sample was used for each calculation of the Monte Carlo expectations. We used the leading ten terms to compute the infinite sum in the KL divergence between the Beta and Kumaraswamy.