$\alpha$TC-VAE: On the relationship between Disentanglement and Diversity

Authors: Cristian Meo, Louis Mahon, Anirudh Goyal, Justin Dauwels

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we introduce α-TCVAE, a variational autoencoder optimized using a novel total correlation (TC) lower bound that maximizes disentanglement and latent variables informativeness. The proposed TC bound is grounded in information theory constructs, generalizes the β-VAE lower bound, and can be reduced to a convex combination of the known variational information bottleneck (VIB) and conditional entropy bottleneck (CEB) terms. Moreover, we present quantitative analyses and correlation studies that support the idea that smaller latent domains (i.e., disentangled representations) lead to better generative capabilities and diversity. Additionally, we perform downstream task experiments from both representation and RL domains to assess our questions from a broader ML perspective. Our results demonstrate that α-TCVAE consistently learns more disentangled representations than baselines and generates more diverse observations without sacrificing visual fidelity. Notably, α-TCVAE exhibits marked improvements on MPI3D-Real, the most realistic disentangled dataset in our study, confirming its ability to represent complex datasets when maximizing the informativeness of individual variables. Finally, testing the proposed model off-the-shelf on a state-of-the-art model-based RL agent, Director, significantly shows α-TCVAE downstream usefulness on the loconav Ant Maze task.
Researcher Affiliation Collaboration Cristian Meo TUDelft, NL. Louis Mahon University of Edinburgh, UK. Anirudh Goyal Google Deep Mind, UK. Justin Dauwels TUDelft, NL.
Pseudocode No The paper provides mathematical derivations and explanations but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Implementation available at https://github.com/Cmeo97/Alpha-TCVAE
Open Datasets Yes We validate the considered models on the following datasets. Teapots (Moreno et al., 2016) contains 200, 000 images of teapots with features: azimuth and elevation, and object colour. 3DShapes (Burgess & Kim, 2018) contains 480, 000 images, with features: object shape and colour, floor colour, wall colour, and horizontal orientation. MPI3D-Real (Gondal et al., 2019) contains 103, 680 images of objects at the end of a robot arm, with features: object colour, size, shape, camera height, azimuth, and robot arm altitude. Cars3D (Reed et al., 2015) contains 16, 185 images with features: car-type, elevation, and azimuth. Celeb A (Liu et al., 2015) contains over 200, 000 images of faces under a broad range of poses, facial expressions, and lighting conditions, totalling 40 different factors.
Dataset Splits No Every model is trained using a subset containing the 80% of the selected dataset images in a fully unsupervised way. The models are evaluated on the remaining images using the following downstream scores. This indicates a train/test split, but no explicit validation split is mentioned.
Hardware Specification Yes Moreover, Style-GAN takes 15x the training time ( 2hrs vs. > 30hrs on a single Nvidia Titan XP)
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The hyperparameters used for the different experiments are shown in Table 2. Table 2: Comparison of the different hyperparameters used across the Datasets | Dataset | β | γ | α | latent dim K | Training Epochs | | Teapots | 2 | 10 | 0.25 | 10 | 50 | | 3DShapes | 3 | 10 | 0.25 | 10 | 50 | | Cars3D | 4 | 10 | 0.25 | 10 | 50 | | MPI3D-Real | 5 | 10 | 0.25 | 10 | 50 | | Celeba | 5 | 10 | 0.25 | 48 | 50 | All encoder, decoder and discriminator architectures are taken from Roth et al. (2023).