Lagging Inference Networks and Posterior Collapse in Variational Autoencoders

Authors: Junxian He, Daniel Spokoyny, Graham Neubig, Taylor Berg-Kirkpatrick

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our approach outperforms strong autoregressive baselines on text and image benchmarks in terms of held-out likelihood, and is competitive with more complex techniques for avoiding collapse while being substantially faster.Our experiments below are designed to (1) examine whether the proposed method indeed prevents posterior collapse, (2) test its efficacy with respect to maximizing predictive log-likelihood compared to other existing approaches, and (3) test its training efficiency.
Researcher Affiliation Collaboration Junxian He, Daniel Spokoyny, Graham Neubig Carnegie Mellon University {junxianh,dspokoyn,gneubig}@cs.cmu.edu Taylor Berg-Kirkpatrick University of California San Diego tberg@eng.ucsd.edu
Pseudocode Yes Algorithm 1 VAE training with controlled aggressive inference network optimization.
Open Source Code Yes 1Code and data are available at https://github.com/jxhe/vae-lagging-encoder.
Open Datasets Yes We evaluate our method on density estimation for text on the Yahoo and Yelp corpora (Yang et al., 2017) and images on OMNIGLOT (Lake et al., 2015).We generated a dataset with 20,000 examples (train/val/test is 16000/2000/2000) each of length 10 from a vocabulary of size 1000.We use the same train/val/test splits as provided by Kim et al. (2018).
Dataset Splits Yes We generated a dataset with 20,000 examples (train/val/test is 16000/2000/2000) each of length 10 from a vocabulary of size 1000.We use the same train/val/test splits as provided by Kim et al. (2018).
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments.
Software Dependencies No The paper mentions "SGD" and "Adam" optimizers, and model types like "LSTM" and "Pixel CNN", but it does not specify any software libraries (e.g., PyTorch, TensorFlow) with their version numbers.
Experiment Setup Yes For all experiments we use a Gaussian prior N(0, I) and the inference network parametrizes a diagonal Gaussian.We use 32-dimensional z and optimize ELBO objective with SGD for text and Adam (Kingma & Ba, 2015) for images.Following Kim et al. (2018), we use a single-layer LSTM with 1024 hidden units and 512-dimensional word embeddings as the encoder and decoder for all of text models.We use the SGD optimizer and start with a learning rate of 1.0 and decay it by a factor of 2 if the validation loss has not improved in 2 epochs and terminate training once the learning rate has decayed a total of 5 times.We use the Adam optimizer and start with a learning rate of 0.001 and decay it by a factor of 2 if the validation loss has not improved in 20 epochs.We run all models with 5 different random restarts, and report mean and standard deviation.Full details of the setup are in Appendix B.2 and B.3.