reproducibilityindex.ai

On the Limitations of Multimodal VAEs

Authors: Imant Daunhawer, Thomas M. Sutter, Kieran Chin-Cheong, Emanuele Palumbo, Julia E Vogt

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we showcase the generative quality gap on both synthetic and real data and present the tradeoffs between different variants of multimodal VAEs.
Researcher Affiliation	Academia	Imant Daunhawer, Thomas M. Sutter, Kieran Chin-Cheong, Emanuele Palumbo & Julia E. Vogt Department of Computer Science ETH Zurich dimant@ethz.ch
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	All of the used datasets are either public or can be generated from publicly available resources using the code that we provide in the supplementary material.
Open Datasets	Yes	Poly MNIST (Sutter et al., 2021) is a simple, synthetic dataset... Finally, Caltech Birds (CUB; Wah et al., 2011; Shi et al., 2019) is used to validate the limitations on a more realistic dataset with two modalities, images and captions.
Dataset Splits	No	The paper mentions training models and evaluating on a 'test set', but does not explicitly provide specific numerical training, validation, and test dataset splits (e.g., percentages or sample counts) for reproducibility. It implies standard splits were used, but these are not explicitly defined in the paper.
Hardware Specification	Yes	In total, more than 400 models were trained, requiring approximately 1.5 GPU years of compute on a single NVIDIA Ge Force RTX 2080 Ti GPU.
Software Dependencies	No	The paper mentions using 'Adam optimizer (Kingma and Ba, 2015)' but does not specify version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	All models were trained using the Adam optimizer (Kingma and Ba, 2015) with learning rate 5e-4 and a batch size of 256. For image modalities we estimate likelihoods using Laplace distributions and for captions we employ one-hot categorical distributions. Models were trained for 500, 1000, and 150 epochs on Poly MNIST, Translated-Poly MNIST, and CUB respectively. Similar to previous work, we use Gaussian priors and a latent space with 512 dimensions for Poly MNIST and 64 dimensions for CUB. For the β-ablations, we use β {3e-4, 3e-3, 3e-1, 1, 3, 9} and, in addition, 32 for CUB.