Autodecoding Latent 3D Diffusion Models

Authors: Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc V Gool, Sergey Tulyakov

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations demonstrate that our generation results outperform state-of-the-art alternatives on various benchmark datasets and metrics, including multi-view image datasets of synthetic objects, real in-the-wild videos of moving people, and a large-scale, real video dataset of static objects. and 4 Results and Evaluations In this section, we evaluate our method on multiple diverse datasets (see Sec. 4.1) for both unconditional 4.2 and conditional settings 4.5. We also ablate the design choices in our autodecoder and diffusion in Secs. 4.3 and 4.4, respectively.
Researcher Affiliation Collaboration Evangelos Ntavelis Computer Vision Lab ETH Zurich Zürich, Switzerland entavelis@vision.ee.ethz.ch Aliaksandr Siarohin Creative Vision Snap Inc. Santa Monica, CA, USA asiarohin@snapchat.com Kyle Olszewski Creative Vision Snap Inc. Santa Monica, CA, USA kolszewski@snap.com Chaoyang Wang CI2CV Lab Carnegie Mellon University Pittsburgh, PA, USA chaoyanw@cs.cmu.edu Luc Van Gool CVL, ETH Zurich, CH PSI, KU Leuven, BE INSAIT, Un. Sofia, BU vangool@vision.ee.ethz.ch Sergey Tulyakov Creative Vision Snap Inc. Santa Monica, CA, USA stulyakov@snapchat.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks with explicit labels like "Pseudocode" or "Algorithm".
Open Source Code Yes Code & Visualizations: https://github.com/snap-research/3DVADER
Open Datasets Yes Photo Shape Chairs [57] ABO Tables [13] and Objaverse. This dataset [14] contains 800K publicly available 3D models. and MVImg Net. For this dataset [92]... and Celeb V-Text. The Celeb V-Text dataset [90]...
Dataset Splits No The paper mentions training on datasets and evaluating results, but it does not explicitly provide specific train/validation/test dataset splits with percentages, sample counts, or citations to predefined splits.
Hardware Specification Yes We run our experiments on 8 NVIDIA A100 40GB GPUs per node. For some experiments, we use a single node, while for larger-scale experiments, we use up to 8 nodes in parallel.
Software Dependencies No Our experiments are implemented in the Py Torch [58, 59], using the Py Torch Lightning [19] framework for fast automatic differentiation and scalable GPU-accelerated parallelization. For calculating the perceptual metrics (FID and KID), we used the Torch Fidelity [56] library. The paper mentions software tools like PyTorch, PyTorch Lightning, and Torch Fidelity, but does not provide specific version numbers for these software components.
Experiment Setup Yes We use the Adam optimizer [37] to train both the autodecoder and the diffusion Model. For the first network, we use a learning rate lr = 5e 4 and beta parameters β = (0.5, 0.999). For diffusion, we set the learning rate to lr = 4.5e 4. We apply linear decay to the learning rate. and Table 5: Architecture details for our models for each dataset. SA and CA stand for Self-Attention and Cross-Attention respectively. z refers to our 1D embedding vector and our latent 3D volume for the autodecoder and diffusion models, respectively.