DreamFusion: Text-to-3D using 2D Diffusion

Authors: Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the ability of Dream Fusion to generate coherent 3D scenes from a variety of text prompts. We compare to existing zero-shot text-to-3D generative models, identify the key components of our model that enable accurate 3D geometry, and explore the qualitative capabilities of Dream Fusion such as the compositional generation shown in Figure 4. 3D reconstruction tasks are typically evaluated using reference-based metrics which compares recovered geometry to some ground truth. The view-synthesis literature often uses PSNR to compare rendered views with a held-out photograph. These reference-based metrics are difficult to apply to zero-shot text-to-3D generation, as there is no true 3D scene corresponding to our text prompts. Following Jain et al. (2022), we evaluate the CLIP R-Precision (Park et al., 2021a), an automated metric for the consistency of rendered images with respect to the input caption. The R-Precision is the accuracy with which CLIP (Radford et al., 2021) retrieves the correct caption among a set of distractors given a rendering of the scene.
Researcher Affiliation Collaboration 1Google Research, 2UC Berkeley {pooleb, barron, bmild}@google.com, ajayj@berkeley.edu
Pseudocode Yes Figure 8: Pseudocode for Score Distillation Sampling with an application-specific generator that defines a differentiable mapping from parameters to images.
Open Source Code No While the Imagen diffusion model is not publicly available, other conditional diffusion models may produce similar results with the Dream Fusion algorithm. To aid reproducibility, we have included a schematic overview of the algorithm in Figure 3, pseudocode for Score Distillation Sampling in Figure 8, hyperparameters in Appendix A.2, and additional evaluation setup details in Appendix A.3.
Open Datasets Yes We use the 153 prompts from the object-centric COCO validation subset of Dream Fields. We also measure CLIP R-Precision on textureless renders to evaluate geometry since we found existing metrics do not capture the quality of the geometry, often yielding high values when texture is painted on flat geometry.
Dataset Splits No The paper states: "For each text prompt, we train a randomly initialized Ne RF from scratch." It does not describe a traditional training/validation/test split for the Dream Fusion model itself. The "validation subset" mentioned for COCO is used for evaluating the final generated models, acting as a test set, not a validation set during training.
Hardware Specification Yes Our 3D scenes are optimized on a TPUv4 machine with 4 chips.
Software Dependencies No The paper mentions software like "Imagen model", "mip-Ne RF 360", "T5-XXL text embeddings", and "Distributed Shampoo optimizer", but it does not provide specific version numbers for any of these components or other general software dependencies (e.g., Python, PyTorch).
Experiment Setup Yes For each text prompt, we train a randomly initialized Ne RF from scratch. Each iteration of Dream Fusion optimization performs the following: (1) randomly sample a camera and light, (2) render an image of the Ne RF from that camera and shade with the light, (3) compute gradients of the SDS loss with respect to the Ne RF parameters, (4) update the Ne RF parameters using an optimizer. ... We optimize for 15,000 iterations which takes around 1.5 hours. ... Our model is built upon mip-Ne RF 360 (Barron et al., 2022)... The underlying sinusoidal positional encoding function uses frequencies 20, 21, . . . , 2L−1, where we set L = 8. ... Our Ne RF MLP consists of 5 Res Net blocks (He et al., 2016) with 128 hidden units, Swish/Si LU activation (Hendrycks & Gimpel, 2016), and layer normalization (Ba et al., 2016) between blocks. ...We use Distributed Shampoo (Anil et al., 2020) with β1 = 0.9, β2 = 0.9, exponent override = 2, block size = 128, graft type = SQRT N, ϵ = 10−6, and a linear warmup of learning rate over 3000 steps from 10−9 to 10−4 followed by cosine decay down to 10−6.