NVAE: A Deep Hierarchical Variational Autoencoder
Authors: Arash Vahdat, Jan Kautz
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models on the MNIST, CIFAR-10, Celeb A 64, and Celeb A HQ datasets and it provides a strong baseline on FFHQ. For example, on CIFAR-10, NVAE pushes the state-of-the-art from 2.98 to 2.91 bits per dimension, and it produces high-quality images on Celeb A HQ as shown in Fig. 1. To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256 256 pixels. |
| Researcher Affiliation | Industry | Arash Vahdat, Jan Kautz NVIDIA {avahdat, jkautz}@nvidia.com |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. It provides architectural diagrams and mathematical formulations instead. |
| Open Source Code | Yes | The source code is available at https://github.com/NVlabs/NVAE. |
| Open Datasets | Yes | We examine NVAE on the dynamically binarized MNIST [72], CIFAR-10 [73], Image Net 32 32 [74], Celeb A 64 64 [75, 76], Celeb A HQ 256 256 [28], and FFHQ 256 256 [77] datasets. |
| Dataset Splits | No | The paper references the use of various datasets and reports results on them, implying standard splits, but it does not explicitly state the specific train/validation/test dataset splits (e.g., percentages or exact sample counts for each partition) used for reproducibility. |
| Hardware Specification | Yes | On a 12-GB Titan V GPU, we can sample a batch of 36 images of the size 256 256 px in 2.03 seconds (56 ms/image). |
| Software Dependencies | No | The paper mentions using the 'NVIDIA APEX library [54]' but does not provide specific version numbers for this or any other key software dependencies (e.g., deep learning frameworks, Python version) required for reproducibility. |
| Experiment Setup | Yes | For large image datasets such as Celeb A HQ and FFHQ, NVAE consists of 36 groups of latent variables starting from 8 8 dims, scaled up to 128 128 dims with two residual cells per latent variable groups. The implementation details are provided in Sec. A in Appendix. We apply KL balancing mechanism only during KL warm-up (the first 25000 iterations). |