VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids

Authors: Katja Schwarz, Axel Sauer, Michael Niemeyer, Yiyi Liao, Andreas Geiger

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results demonstrate that monolithic MLPs can indeed be replaced by 3D convolutions when combining sparse voxel grids with progressive growing, free space pruning and appropriate regularization. To obtain a compact representation of the scene and allow for scaling to higher voxel resolutions, our model disentangles the foreground object (modeled in 3D) from the background (modeled in 2D).
Researcher Affiliation Academia Katja Schwarz1 Axel Sauer1 Michael Niemeyer1 Yiyi Liao2 Andreas Geiger1 1University of Tübingen and Max Planck Institute for Intelligent Systems, Tübingen 2 Zhejiang University, China
Pseudocode No No pseudocode or clearly labeled algorithm blocks were found in the paper.
Open Source Code Yes Code and models are available at https://github.com/autonomousvision/voxgraf.
Open Datasets Yes The synthetic Carla dataset [8, 37] contains 10k images and camera poses of 18 car models with randomly sampled colors. FFHQ [19] comprises 70k aligned face images. AFHQv2 Cats [5] consists of 4834 cat faces.
Dataset Splits No The paper mentions using "the full dataset" for FID evaluation and augmenting datasets, but does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or sample counts) used for training the model itself.
Hardware Specification Yes Depending on the dataset, we train our models for 3 to 7 days on 8 Tesla V100 GPUs. For all runtime comparisons, we report times on a single Tesla V100 GPU with a batch size of 1.
Software Dependencies No The paper mentions using 'custom CUDA kernels', the 'Minkowski Engine library', and the 'Style GAN2 architecture', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We train our approach with Adam [21] using a batch size of 64 at grid resolution RG = 32, 64 and 32 at RG = 128. We use a learning rate of 0.0025 for the generator and 0.002 for the discriminator.