Score Distillation via Reparametrized DDIM

Authors: Artem Lukoianov, Haitz Sáez de Ocáriz Borde, Kristjan Greenewald, Vitor Guizilini, Timur Bagautdinov, Vincent Sitzmann, Justin M. Solomon

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, our method achieves better or similar 3D generation quality compared to other state-of-the-art Score Distillation methods, all without training additional neural networks or multi-view supervision, and providing useful insights into relationship between 2D and 3D asset generation with diffusion models. Figure 2: Examples of 3D objects generated with our method. 6 Experiments 6.1 3D generation We demonstrate the high-fidelity 3D shapes generated with our algorithm in fig. 2 and provide more examples of 360 views in appendix H.2. A more detailed qualitative and quantitative comparison of our method with ISM [29] is provided in appendix B. Additionally, we report the diversity of the generated shapes in appendix C. Qualitative comparisons. Figure 7 compares 3D generation quality with the results reported in past work using a similar protocol to [11, 12]. Quantitative comparison. We follow [5, 16, 25] to quantitatively evaluate generation quality. Table 1 provides CLIP scores [38] to measure prompt-generation alignment, computed with torchmetrics [39] and the Vi T-B/32 model [40]. We also report Image Reward (IR) [41] to imitate possible human preference. We include CLIP Image Quality Assessment (IQA) [42] to measure quality ( Good photo vs. Bad photo ), sharpness ( Sharp photo vs. Blurry photo ), and realism ( Real photo vs. Abstract photo ). For each method, we test 43 prompts with 50 views. For multi-stage baselines, we run only the first stage for fair comparison. We report the percentage of generations that run out-of-memory or generate an empty volume as diverged ( Div. in the table). as well as mean run time and VRAM usage. For VRAM, we average the maximum usage of GPU memory between runs. As many baselines are not open-source, we use their implementations in threestudio [43]. SDI outperforms SDS and matches or outperforms the quality of state-of-the-art methods, offering a simple fix to SDS without additional supervision or multi-stage training. 6.2 Ablations Proposed improvements. Figure 8 ablates the changes we implement on top of SDS.
Researcher Affiliation Collaboration Artem Lukoianov1 Haitz Sáez de Ocáriz Borde2 Kristjan Greenewald3 Vitor Campagnolo Guizilini4 Timur Bagautdinov5 Vincent Sitzmann1 Justin Solomon1 1Massachusetts Institute of Technology 2University of Oxford 3MIT-IBM Watson AI Lab, IBM Research 4Toyota Research Institute 5Meta Reality Labs Research
Pseudocode Yes Figure 12: Comparison of the original SDS algorithm and our proposed changes. Algorithm 1 Dreamfusion (SDS) Input: ψ RN parametrized 3D shape C set of cameras around the 3D shape y text prompt g : RN C Rn n differentiable renderer ϵ(t) θ : Rn n Rn n trained diffusion model Output: 3D shape ψ of y procedure DREAMFUSION(y) for i in range(n_iters) do t Uniform(0, 1) c Uniform(C) ϵ Normal(0, I) xt p α(t)g(ψ, c) + p ψLSDS = σ(t) h ϵ(t) θ (xt, y) ϵ i g ψ Backpropagate ψLSDS SGD update on ψ Algorithm 2 Ours (SDI) Input: ψ RN parametrized 3D shape C set of cameras around the 3D shape y text prompt g : RN C Rn n differentiable renderer ϵ(t) θ : Rn n Rn n trained diffusion model Output: 3D shape ψ of y procedure OURS(y) for i in range(n_iters) do t 1 i/n_iters c Uniform(C) ϵ κt+τ y (g(ψ, c)) xt p α(t)g(ψ, c) + p ψLSDS = σ(t) h ϵ(t) θ (xt, y) ϵ i g ψ Backpropagate ψLSDS SGD update on ψ
Open Source Code Yes Additionally, we provide the code of our algorithm in supplementary and plan to make it public upon acceptance. Code is included in the supplementary material.
Open Datasets Yes We use Stable Diffusion 2.1 [1] as the diffusion model.
Dataset Splits No The paper uses pre-trained models (Stable Diffusion 2.1) and evaluates on a set of generated prompts and views, but does not specify train/validation/test dataset splits for its own experimental setup in the traditional sense of partitioning a dataset for training and evaluation.
Hardware Specification Yes We use NVIDIA A6000 GPUs and run each generation for 10k steps with learning rate of 10 2, which takes approximately 2 wall clock hours per shape generation.
Software Dependencies Yes Our implementation uses threestudio [43] on top of SDS [5]. We use Stable Diffusion 2.1 [1] as the diffusion model.
Experiment Setup Yes For volumetric representation we use Instant NGP [8]. Instead of randomly sampling time t as in SDS, we maintain a global parameter t that linearly decays from 1 to 0.2 (lower time values do not have a significant contribution). Next, for each step we render a 512 512 random view and infer κt+τ by running DDIM inversion for int(10t) steps i.e. we use 10 steps of DDIM inversion for t = 1 and linearly decrease it for smaller t. We use NVIDIA A6000 GPUs and run each generation for 10k steps with learning rate of 10 2.