ControlVAE: Controllable Variational Autoencoder
Authors: Huajie Shao, Shuochao Yao, Dachun Sun, Aston Zhang, Shengzhong Liu, Dongxin Liu, Jun Wang, Tarek Abdelzaher
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The framework is evaluated using three applications; namely, language modeling, disentangled representation learning, and image generation. The results show that Control VAE can achieve much better reconstruction quality than the competitive methods for the comparable disentanglement performance. For language modeling, it not only averts the KL-vanishing, but also improves the diversity of generated text. Finally, we also demonstrate that Control VAE improves the reconstruction quality for image generation compared to the original VAE. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA. 2AWS Deep Learning, CA, USA. 3Alibaba Group, Seattle, WA. |
| Pseudocode | Yes | We summarize the proposed PI control algorithm in Algorithm 1. Our PI algorithm updates the hyperparameter, β(t), with the feedback from sampled KL-divergence at training step t. |
| Open Source Code | Yes | Source code is publicly available at https://github. com/shj1987/Control VAE-ICML2020.git |
| Open Datasets | Yes | Language modeling: 1) Penn Tree Bank (PTB) (Marcus et al., 1993): it consists of 42, 068 training sentences, 3, 370 validation sentences and 3, 761 testing sentences. 2) Switchboard(SW) (Godfrey & Holliman, 1997): it has 2400 two-sided telephone conversations with manually transcribed speech and alignment. The data is randomly split into 2316, 60 and 62 dialog for training, validation and testing. Disentangling: 1) 2D Shapes (Matthey et al., 2017): it has 737, 280 binary 64 64 images of 2D shapes with five ground truth factors (number of values): shape(3), scale(6), orientation(40), x-position(32), yposition(32) (Kim & Mnih, 2018). Image generation: 1) Celeb A(cropped version) (Liu et al., 2015): It has 202, 599 RGB 128 128 3 images of celebrity faces. The data is split into 192, 599 and 10, 000 images for training and testing. |
| Dataset Splits | Yes | Language modeling: 1) Penn Tree Bank (PTB) (Marcus et al., 1993): it consists of 42, 068 training sentences, 3, 370 validation sentences and 3, 761 testing sentences. 2) Switchboard(SW) (Godfrey & Holliman, 1997): it has 2400 two-sided telephone conversations with manually transcribed speech and alignment. The data is randomly split into 2316, 60 and 62 dialog for training, validation and testing. |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments (e.g., GPU models, CPU types, memory). |
| Software Dependencies | No | The paper mentions using Transformer as the decoder but does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | The detailed model configurations and hyperparameter settings for each model is presented in Appendix A. Specifically for Language modeling: Following PI tuning strategy in Section 3.1, we set Kp, Ki of the PI algorithm in (6) to 0.01 and 0.0001, respectively. In addition, βmin is set to 0 and the maximum value of β(t) is limited to 1. For disentangling: Since β(t) > 1, we set βmin to 1 for the PI algorithm in (6). Following the PI tuning method above, the coefficients Kp and Ki are set to 0.01 and 0.001, respectively. For image generation: For this task, we use the same PI control algorithm and hyperparameters as the above language modeling. |