Learning to Drop Out: An Adversarial Approach to Training Sequence VAEs
Authors: Djordje Miladinovic, Kumar Shridhar, Kushal Jain, Max Paulus, Joachim M Buhmann, Carl Allen
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here we present the results of our experiments that: (i) demonstrate that a sequence VAE trained with adversarial word dropout (AWD) outperforms other sequence VAEs; it achieves improved sentence modeling performance and/or improved informativeness of the latent space; (ii) examine the contributions and behaviour of the adversarial network s components and hyperparameters, and (iii) qualitatively study the trained adversary and VAE. |
| Researcher Affiliation | Collaboration | Ðor de Miladinovi c Kumar Shridhar Kushal Jain Max B. Paulus Joachim M. Buhmann Mrinmaya Sachan Carl Allen GSK.ai ETH Zürich University of California, San Diego |
| Pseudocode | No | The paper describes mathematical formulations and processes but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (Supplementary) |
| Open Datasets | Yes | We conducted experiments on 4 different datasets: Yahoo questions and answers [Yang et al., 2017], Yelp reviews [Yang et al., 2017], Penn Tree Bank (PTB) Marcus et al. [1993] and downsampled Stanford Natural Language Inference (SNLI) corpus [Bowman et al., 2015, Li et al., 2019]. |
| Dataset Splits | Yes | Yahoo, Yelp and SNLI datasets contain 100K sentences in the training set, 10K in the validation set, and 10K in the test set, while PTB is much smaller with a total of 42K sentences. |
| Hardware Specification | Yes | All experiments are performed on a 12GB Nvidia Titan XP GPU with an average run time of 4 hours for Yelp and Yahoo and 1 hour for SNLI. |
| Software Dependencies | No | The paper mentions several techniques and models (e.g., LSTM, reparametrization trick, stochastic softmax trick, gradient reversal layer) and cites their original papers, but it does not specify version numbers for any software libraries or dependencies (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | On each dataset, we performed the same grid search over both learning rate (from {0.0001, 0.001, 0.1, 1}) and dropout rate R (from {0.2, 0.3, 0.4, 0.5}) for both the word dropout baseline and our method. This gives 16 different hyperparameter configurations for each method on each dataset. For training, we also use an exponential learning decay of 0.96 as in [Li and Arora, 2019], increased the hidden state sze of the decoder LSTM from 1024 to 2048 (except on SNLI), applied Polyak averaging [Polyak and Juditsky, 1992] with a coefficient of 0.9995 and used KL annealing [Bowman et al., 2016]. |