NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion
Authors: Jiatao Gu, Alex Trevithick, Kai-En Lin, Joshua M. Susskind, Christian Theobalt, Lingjie Liu, Ravi Ramamoorthi
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on three challenging benchmarks. Our results indicate that the proposed Nerf Diff significantly outperforms all the existing baselines, achieving high-quality generation with multi-view consistency. See supplementary materials for video results. We summarize the main contributions as follows: We develop a novel framework Nerf Diff which jointly learns an image-conditioned Ne RF and a CDM, and at test time finetunes the learned Ne RF using a multi-view consistent diffusion process ( 4.3, 4.4). We introduce an efficient image-conditioned Ne RF representation based on camera-aligned triplanes, which is the core component enabling efficient rendering and finetuning from the CDM ( 4.1). We propose a 3D-aware CDM, which integrates volume rendering into 2D diffusion models, facilitating generalization to novel views ( 4.2). |
| Researcher Affiliation | Collaboration | 1Apple 2University of California, San Diego 3Max Planck Institute for Informatics, Germany 4University of Pennsylvania. Correspondence to: Jiatao Gu <jiatao@apple.com>, Alex Trevithick <atrevithick@ucsd.edu>, Kai-En Lin <k2lin@ucsd.edu>, Lingjie Liu <lingjie.liu@seas.upenn.edu>. |
| Pseudocode | Yes | Algorithm 1 Finetuning with Ne RF-guided distillation. Input: Ne RF (MLP fθ, triplanes W), CDM ϵϕ, input Is, γ, N, B 1 Initialize Iπ = Iπ θ,W , ϵπ = ϵ, π Π, ϵ N(0, 1) for t = tmax . . . tmin do 2 for π Π do 3 Zπ = αt Iπ + σtϵπ; ϵπ = ϵϕ(Zπ, Is) + γσt/αt (Iπ Iπ θ,W ) Iπ = (Zπ σtϵπ)/αt 4 for n = 1 . . . N do 5 for b = 1 . . . B do 6 Sample a view π Π and sample a ray r from π; 7 Update θ, W with θ,W 1 π,r Iπ θ,W (r) Iπ(r) 2 2 8 return θ, W |
| Open Source Code | No | The paper provides a supplementary website link (https://jiataogu. me/nerfdiff) for "video results", but it does not explicitly state that the source code for their method is available at this link or elsewhere. The text only refers to third-party code as "publicly available source code" when discussing baselines. |
| Open Datasets | Yes | We evaluate Nerf Diff on three benchmarks SRN-Shape Net (Sitzmann et al., 2019a), Amazon-Berkeley Objects (ABO, Collins et al., 2022) and Clevr3D (Stelzner et al., 2021) for testing novel view synthesis under single-category, category-agnostic, and multi-object settings, respectively. SRN-Shape Net includes two categories: Cars and Chairs. Dataset details are given in Appendix A. We use the data hosted by pixel Ne RF (Yu et al., 2021), which can be downloaded from Git Hub (https://github.com/sxyu/pixel-nerf). We also consider the ABO dataset (Collins et al., 2022) from https://amazon-berkeley-objects.s3.amazonaws.com/index.html under the title ABO 3D Renderings. We consider the Clevr3D dataset provided in (Stelzner et al., 2021) for multi-object/scene level learning, which can be downloaded from the github https: //github.com/stelzner/obsurf. |
| Dataset Splits | Yes | The chairs dataset consists of 6591 scenes, and the cars dataset has 3514 scenes, both with a predefined train/val/test split. The dataset thus consists of 6743 training scenes, 396 validation scenes, and 794 testing scenes. We define a custom split in which there are 70000 training scenes and 1000 held-out testing scenes. |
| Hardware Specification | Yes | We train all models with a batch size of 32 images for 500K iterations on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions software components like Adam W and Adam optimizer, but does not provide specific version numbers for these or other key software dependencies like programming languages, deep learning frameworks, or operating systems. |
| Experiment Setup | Yes | For all datasets, we learn Nerf Diff based on the U-Net architecture adopted from ADM (Dhariwal & Nichol, 2021) with two sets of configurations (-B: base 400M parameters, -L: large 1B parameters). More specifically, we set the model dimension d = 192 with 2 residual blocks per resolution for the base architecture and d = 256 with 3 residual blocks per resolution for the large architecture. All other hyperparameters follow the default setting as ADM. We set λIC = λDM = 1, which means that we add the two losses of the two modules without re-weighting. All models are trained using Adam W (Loshchilov & Hutter, 2017) with a learning rate of 2e 5 and an EMA decaying rate of 0.9999. We train all models with a batch size of 32 images for 500K iterations on 8 A100 GPUs. Training takes 3 4 days to finish for base models. For the multiview diffusion process, we run 64 DDIM (Song et al., 2020) steps with the CDM for each view, respectively. At every diffusion step, we update the Ne RF parameters N = 64 steps, with a batch size of B = 4096 rays. We use Adam optimizer (Kingma & Ba, 2015) set the learning rates for Ne RF MLPs 1e 4 and the triplane features 5e 2, respectively. |