On the Limitations of Multimodal VAEs
Authors: Imant Daunhawer, Thomas M. Sutter, Kieran Chin-Cheong, Emanuele Palumbo, Julia E Vogt
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we showcase the generative quality gap on both synthetic and real data and present the tradeoffs between different variants of multimodal VAEs. |
| Researcher Affiliation | Academia | Imant Daunhawer, Thomas M. Sutter, Kieran Chin-Cheong, Emanuele Palumbo & Julia E. Vogt Department of Computer Science ETH Zurich dimant@ethz.ch |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | All of the used datasets are either public or can be generated from publicly available resources using the code that we provide in the supplementary material. |
| Open Datasets | Yes | Poly MNIST (Sutter et al., 2021) is a simple, synthetic dataset... Finally, Caltech Birds (CUB; Wah et al., 2011; Shi et al., 2019) is used to validate the limitations on a more realistic dataset with two modalities, images and captions. |
| Dataset Splits | No | The paper mentions training models and evaluating on a 'test set', but does not explicitly provide specific numerical training, validation, and test dataset splits (e.g., percentages or sample counts) for reproducibility. It implies standard splits were used, but these are not explicitly defined in the paper. |
| Hardware Specification | Yes | In total, more than 400 models were trained, requiring approximately 1.5 GPU years of compute on a single NVIDIA Ge Force RTX 2080 Ti GPU. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer (Kingma and Ba, 2015)' but does not specify version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | All models were trained using the Adam optimizer (Kingma and Ba, 2015) with learning rate 5e-4 and a batch size of 256. For image modalities we estimate likelihoods using Laplace distributions and for captions we employ one-hot categorical distributions. Models were trained for 500, 1000, and 150 epochs on Poly MNIST, Translated-Poly MNIST, and CUB respectively. Similar to previous work, we use Gaussian priors and a latent space with 512 dimensions for Poly MNIST and 64 dimensions for CUB. For the β-ablations, we use β {3e-4, 3e-3, 3e-1, 1, 3, 9} and, in addition, 32 for CUB. |