Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

Authors: Ziyi Wu, Yulia Rubanova, Rishabh Kabra, Drew Hudson, Igor Gilitschenski, Yusuf Aytar, Sjoerd van Steenkiste, Kelsey Allen, Thomas Kipf

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct extensive experiments to answer the following questions: (i) Can Neural Assets enable accurate 3D object editing? (ii) What practical applications does our method support on real-world scenes? (iii) What is the impact of each design choice in our framework? We report common metrics to measure the quality of the edited image PSNR, SSIM [104], LPIPS [117], and FID [42].
Researcher Affiliation Collaboration 1Google Deep Mind 2Google Research 3University of Toronto 4Vector Institute 5UCL
Pseudocode No The paper describes the proposed method using textual explanations and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper states: 'Additional details and video results are available at our project page.' and 'Project page: neural-assets.github.io'. However, the project page indicates 'Code coming soon!', meaning concrete access to the source code is not yet provided.
Open Datasets Yes We select four datasets with object or camera motion, which span different levels of complexity. OBJect [67]... MOVi-E [36]... Objectron [1]... Waymo Open [97]... This dataset is under the Open Data Commons Attribution License (ODC-By)2. The full data generation pipeline is under the Apache 2.0 license3. Objectron is licensed under the Computational Use of Data Agreement 1.0 (C-UDA-1.0)4. Waymo Open is licensed under the Waymo Dataset License Agreement for Non-Commercial Use (August 2019)5.
Dataset Splits No The paper describes training and testing procedures and metrics, but it does not explicitly provide details about specific validation dataset splits or how validation data was used.
Hardware Specification Yes We train all model components jointly using the Adam optimizer [53] with a batch size of 1536 on 256 TPUv5 chips (16GB memory each).
Software Dependencies No We implement the entire Neural Assets framework in JAX [10] using the Flax [40] neural network library. However, specific version numbers for these libraries are not provided.
Experiment Setup Yes For all experiments, we resize images to 256 256. DINO self-supervised pre-trained Vi T-B/8 [13] is adopted as the visual encoder Enc, and jointly fine-tuned with the generator. All our models are trained using the Adam optimizer [53] with a batch size of 1536 on 256 TPUv5 chips (16GB memory each). We use a peak learning rate of 5 10 5 for the image generator and the visual encoder, and a larger learning rate of 1 10 3 for remaining layers (MLPs and linear projection layers). Both learning rates are linearly warmed up in the first 1,000 steps and stay constant. A gradient clipping of 1.0 is applied to stabilize training. We train the model for 200k steps on OBJect and MOVi-E which takes 24 hours, and 50k steps on Objectron and Waymo Open which takes 6 hours. In order to apply classifier-free guidance (CFG) [43], we randomly drop the appearance and pose token (i.e., setting them as zeros) with a probability of 10%. We run the DDIM sampler [95] for 50 steps to generate images. We found the model works well with CFG scale between 1.5 and 4, and thus choose to use 2.0 in all the experiments.