Learning to Drop Out: An Adversarial Approach to Training Sequence VAEs

Authors: Djordje Miladinovic, Kumar Shridhar, Kushal Jain, Max Paulus, Joachim M Buhmann, Carl Allen

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here we present the results of our experiments that: (i) demonstrate that a sequence VAE trained with adversarial word dropout (AWD) outperforms other sequence VAEs; it achieves improved sentence modeling performance and/or improved informativeness of the latent space; (ii) examine the contributions and behaviour of the adversarial network s components and hyperparameters, and (iii) qualitatively study the trained adversary and VAE.
Researcher Affiliation Collaboration Ðor de Miladinovi c Kumar Shridhar Kushal Jain Max B. Paulus Joachim M. Buhmann Mrinmaya Sachan Carl Allen GSK.ai ETH Zürich University of California, San Diego
Pseudocode No The paper describes mathematical formulations and processes but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (Supplementary)
Open Datasets Yes We conducted experiments on 4 different datasets: Yahoo questions and answers [Yang et al., 2017], Yelp reviews [Yang et al., 2017], Penn Tree Bank (PTB) Marcus et al. [1993] and downsampled Stanford Natural Language Inference (SNLI) corpus [Bowman et al., 2015, Li et al., 2019].
Dataset Splits Yes Yahoo, Yelp and SNLI datasets contain 100K sentences in the training set, 10K in the validation set, and 10K in the test set, while PTB is much smaller with a total of 42K sentences.
Hardware Specification Yes All experiments are performed on a 12GB Nvidia Titan XP GPU with an average run time of 4 hours for Yelp and Yahoo and 1 hour for SNLI.
Software Dependencies No The paper mentions several techniques and models (e.g., LSTM, reparametrization trick, stochastic softmax trick, gradient reversal layer) and cites their original papers, but it does not specify version numbers for any software libraries or dependencies (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes On each dataset, we performed the same grid search over both learning rate (from {0.0001, 0.001, 0.1, 1}) and dropout rate R (from {0.2, 0.3, 0.4, 0.5}) for both the word dropout baseline and our method. This gives 16 different hyperparameter configurations for each method on each dataset. For training, we also use an exponential learning decay of 0.96 as in [Li and Arora, 2019], increased the hidden state sze of the decoder LSTM from 1024 to 2048 (except on SNLI), applied Polyak averaging [Polyak and Juditsky, 1992] with a coefficient of 0.9995 and used KL annealing [Bowman et al., 2016].