Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders

Authors: Masha Itkina, Boris Ivanovic, Ransalu Senanayake, Mykel J. Kochenderfer, Marco Pavone

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on diverse tasks, such as image generation and human behavior prediction, demonstrate the effectiveness of our proposed technique at reducing the discrete latent sample space size of a model while maintaining its learned multimodality.
Researcher Affiliation Academia Masha Itkina, Boris Ivanovic, Ransalu Senanayake, Mykel J. Kochenderfer, Marco Pavone Department of Aeronautics and Astronautics Stanford University {mitkina, borisi, ransalu, mykel, pavone}@stanford.edu
Pseudocode No The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes The code to reproduce our results can be found at: https://github.com/sisl/Evidential Sparsification.
Open Datasets Yes To validate our method, we consider CVAE architectures designed for the tasks of class-conditioned image generation and pedestrian trajectory prediction. These real-world tasks require modeling high degrees of distributional multimodality. We compare our method to the softmax distribution and the popular class-reduction technique termed sparsemax which achieves a sparse distribution by projecting an input vector onto the probability simplex [24]. By design, both our method and sparsemax compute an implicit threshold for each input query post hoc. Thus, they do not need to be tuned for each network or dataset, and automatically adapt to individual input features. Experiments indicate that our method is able to better balance the objectives of sparsity and multimodality than sparsemax by keeping only the latent classes that receive direct evidence from the network s features and weights, as described in Section 2. We demonstrate that our method maintains distributional multimodality, unlike sparsemax, whilst yielding a significantly reduced latent sample space size over softmax. The code to reproduce our results can be found at: https://github.com/sisl/Evidential Sparsification. [...] To gain insight into our proposed approach, we run experiments on a small network trained on MNIST [25]. We then demonstrate the performance of our sparsification algorithm on the large discrete latent space within the state-of-the-art VQ-VAE [2] architecture trained on mini Image Net [26]. All image generation experiments were run on a single NVIDIA Ge Force GTX 1070 GPU. [...] We evaluate Trajectron++ s performance with different probability filtering schemes on 203 randomly-sampled examples from the test set of the ETH pedestrian dataset [35], consisting of real world human trajectories with rich interaction scenarios.
Dataset Splits No The paper mentions a 'test set' for ETH pedestrian dataset and a 'held-out subset' for mini Image Net, but it does not provide specific details on training, validation, or test splits (e.g., percentages or counts) that would be needed for full reproducibility of dataset partitioning for the main models.
Hardware Specification Yes All image generation experiments were run on a single NVIDIA Ge Force GTX 1070 GPU. [...] Behavior prediction model training and experiments were performed on two NVIDIA GTX 1080 Ti GPUs.
Software Dependencies No The paper mentions using PyTorch implicitly through a GitHub link for VQ-VAE, but it does not specify version numbers for any software, libraries, or dependencies.
Experiment Setup Yes We choose K = 10 latent classes; with a perfectly trained network, this would yield a 5-modal distribution when conditioned on one of y {even, odd}. [...] The Gumbel-Softmax distribution is used to backpropagate gradients through the discrete latent space [7, 29]. The model is trained to maximize the standard conditional evidence lower bound (ELBO) [16]. [...] We consider a latent space of 32 32 discrete latent variables with K = 512 classes each. As in the original paper [2], we train a Pixel CNN [31] network for the prior, but reduce its capacity to 20 layers with a hidden dimension size of 128. [...] The loss function is comprised of the classic conditional ELBO loss in a β-VAE [33] scheme and a mutual information loss term on z and y as per [34].