Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models
Authors: Ziyi Wu, Yulia Rubanova, Rishabh Kabra, Drew Hudson, Igor Gilitschenski, Yusuf Aytar, Sjoerd van Steenkiste, Kelsey Allen, Thomas Kipf
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct extensive experiments to answer the following questions: (i) Can Neural Assets enable accurate 3D object editing? (ii) What practical applications does our method support on real-world scenes? (iii) What is the impact of each design choice in our framework? We report common metrics to measure the quality of the edited image PSNR, SSIM [104], LPIPS [117], and FID [42]. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2Google Research 3University of Toronto 4Vector Institute 5UCL |
| Pseudocode | No | The paper describes the proposed method using textual explanations and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: 'Additional details and video results are available at our project page.' and 'Project page: neural-assets.github.io'. However, the project page indicates 'Code coming soon!', meaning concrete access to the source code is not yet provided. |
| Open Datasets | Yes | We select four datasets with object or camera motion, which span different levels of complexity. OBJect [67]... MOVi-E [36]... Objectron [1]... Waymo Open [97]... This dataset is under the Open Data Commons Attribution License (ODC-By)2. The full data generation pipeline is under the Apache 2.0 license3. Objectron is licensed under the Computational Use of Data Agreement 1.0 (C-UDA-1.0)4. Waymo Open is licensed under the Waymo Dataset License Agreement for Non-Commercial Use (August 2019)5. |
| Dataset Splits | No | The paper describes training and testing procedures and metrics, but it does not explicitly provide details about specific validation dataset splits or how validation data was used. |
| Hardware Specification | Yes | We train all model components jointly using the Adam optimizer [53] with a batch size of 1536 on 256 TPUv5 chips (16GB memory each). |
| Software Dependencies | No | We implement the entire Neural Assets framework in JAX [10] using the Flax [40] neural network library. However, specific version numbers for these libraries are not provided. |
| Experiment Setup | Yes | For all experiments, we resize images to 256 256. DINO self-supervised pre-trained Vi T-B/8 [13] is adopted as the visual encoder Enc, and jointly fine-tuned with the generator. All our models are trained using the Adam optimizer [53] with a batch size of 1536 on 256 TPUv5 chips (16GB memory each). We use a peak learning rate of 5 10 5 for the image generator and the visual encoder, and a larger learning rate of 1 10 3 for remaining layers (MLPs and linear projection layers). Both learning rates are linearly warmed up in the first 1,000 steps and stay constant. A gradient clipping of 1.0 is applied to stabilize training. We train the model for 200k steps on OBJect and MOVi-E which takes 24 hours, and 50k steps on Objectron and Waymo Open which takes 6 hours. In order to apply classifier-free guidance (CFG) [43], we randomly drop the appearance and pose token (i.e., setting them as zeros) with a probability of 10%. We run the DDIM sampler [95] for 50 steps to generate images. We found the model works well with CFG scale between 1.5 and 4, and thus choose to use 2.0 in all the experiments. |