Score Distillation via Reparametrized DDIM
Authors: Artem Lukoianov, Haitz Sáez de Ocáriz Borde, Kristjan Greenewald, Vitor Guizilini, Timur Bagautdinov, Vincent Sitzmann, Justin M. Solomon
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, our method achieves better or similar 3D generation quality compared to other state-of-the-art Score Distillation methods, all without training additional neural networks or multi-view supervision, and providing useful insights into relationship between 2D and 3D asset generation with diffusion models. Figure 2: Examples of 3D objects generated with our method. 6 Experiments 6.1 3D generation We demonstrate the high-fidelity 3D shapes generated with our algorithm in fig. 2 and provide more examples of 360 views in appendix H.2. A more detailed qualitative and quantitative comparison of our method with ISM [29] is provided in appendix B. Additionally, we report the diversity of the generated shapes in appendix C. Qualitative comparisons. Figure 7 compares 3D generation quality with the results reported in past work using a similar protocol to [11, 12]. Quantitative comparison. We follow [5, 16, 25] to quantitatively evaluate generation quality. Table 1 provides CLIP scores [38] to measure prompt-generation alignment, computed with torchmetrics [39] and the Vi T-B/32 model [40]. We also report Image Reward (IR) [41] to imitate possible human preference. We include CLIP Image Quality Assessment (IQA) [42] to measure quality ( Good photo vs. Bad photo ), sharpness ( Sharp photo vs. Blurry photo ), and realism ( Real photo vs. Abstract photo ). For each method, we test 43 prompts with 50 views. For multi-stage baselines, we run only the first stage for fair comparison. We report the percentage of generations that run out-of-memory or generate an empty volume as diverged ( Div. in the table). as well as mean run time and VRAM usage. For VRAM, we average the maximum usage of GPU memory between runs. As many baselines are not open-source, we use their implementations in threestudio [43]. SDI outperforms SDS and matches or outperforms the quality of state-of-the-art methods, offering a simple fix to SDS without additional supervision or multi-stage training. 6.2 Ablations Proposed improvements. Figure 8 ablates the changes we implement on top of SDS. |
| Researcher Affiliation | Collaboration | Artem Lukoianov1 Haitz Sáez de Ocáriz Borde2 Kristjan Greenewald3 Vitor Campagnolo Guizilini4 Timur Bagautdinov5 Vincent Sitzmann1 Justin Solomon1 1Massachusetts Institute of Technology 2University of Oxford 3MIT-IBM Watson AI Lab, IBM Research 4Toyota Research Institute 5Meta Reality Labs Research |
| Pseudocode | Yes | Figure 12: Comparison of the original SDS algorithm and our proposed changes. Algorithm 1 Dreamfusion (SDS) Input: ψ RN parametrized 3D shape C set of cameras around the 3D shape y text prompt g : RN C Rn n differentiable renderer ϵ(t) θ : Rn n Rn n trained diffusion model Output: 3D shape ψ of y procedure DREAMFUSION(y) for i in range(n_iters) do t Uniform(0, 1) c Uniform(C) ϵ Normal(0, I) xt p α(t)g(ψ, c) + p ψLSDS = σ(t) h ϵ(t) θ (xt, y) ϵ i g ψ Backpropagate ψLSDS SGD update on ψ Algorithm 2 Ours (SDI) Input: ψ RN parametrized 3D shape C set of cameras around the 3D shape y text prompt g : RN C Rn n differentiable renderer ϵ(t) θ : Rn n Rn n trained diffusion model Output: 3D shape ψ of y procedure OURS(y) for i in range(n_iters) do t 1 i/n_iters c Uniform(C) ϵ κt+τ y (g(ψ, c)) xt p α(t)g(ψ, c) + p ψLSDS = σ(t) h ϵ(t) θ (xt, y) ϵ i g ψ Backpropagate ψLSDS SGD update on ψ |
| Open Source Code | Yes | Additionally, we provide the code of our algorithm in supplementary and plan to make it public upon acceptance. Code is included in the supplementary material. |
| Open Datasets | Yes | We use Stable Diffusion 2.1 [1] as the diffusion model. |
| Dataset Splits | No | The paper uses pre-trained models (Stable Diffusion 2.1) and evaluates on a set of generated prompts and views, but does not specify train/validation/test dataset splits for its own experimental setup in the traditional sense of partitioning a dataset for training and evaluation. |
| Hardware Specification | Yes | We use NVIDIA A6000 GPUs and run each generation for 10k steps with learning rate of 10 2, which takes approximately 2 wall clock hours per shape generation. |
| Software Dependencies | Yes | Our implementation uses threestudio [43] on top of SDS [5]. We use Stable Diffusion 2.1 [1] as the diffusion model. |
| Experiment Setup | Yes | For volumetric representation we use Instant NGP [8]. Instead of randomly sampling time t as in SDS, we maintain a global parameter t that linearly decays from 1 to 0.2 (lower time values do not have a significant contribution). Next, for each step we render a 512 512 random view and infer κt+τ by running DDIM inversion for int(10t) steps i.e. we use 10 steps of DDIM inversion for t = 1 and linearly decrease it for smaller t. We use NVIDIA A6000 GPUs and run each generation for 10k steps with learning rate of 10 2. |