Text-to-3D with Classifier Score Distillation
Authors: Xin Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Song-Hai Zhang, XIAOJUAN QI
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the effectiveness of CSD across a variety of text-to-3D tasks including shape generation, texture synthesis, and shape editing, achieving results superior to those of state-of-the-art methods. |
| Researcher Affiliation | Collaboration | Xin Yu1 Yuan-Chen Guo2,3 Yangguang Li3 Ding Liang3 Song-Hai Zhang2 Xiaojuan Qi1 1The University of Hong Kong 2Tsinghua University 3VAST |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our project page is https://xinyu-andy.github.io/Classifier-Score-Distillation |
| Open Datasets | Yes | We generate 3D objects using 81 diverse text prompts from the website of Dream Fusion. We select 20 diverse meshes from Objaverse (Deitke et al., 2022) for the texture generation task. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, and testing sets. |
| Hardware Specification | Yes | 1 hour on a single A800 GPU as opposed to 8 hours required by Prolific Dreamer. |
| Software Dependencies | No | The paper mentions software components like Deep Floyd-IF, Stable Diffusion 2.1, Stable Diffusion 1.5, Control Nets, Instant-NGP, CLIP, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | For the Ne RF generation, we utilize the Deep Floyd-IF stage-I model (Stability AI, 2023), and for the mesh refinement, we use the Stable Diffusion 2.1 model (Rombach et al., 2022)...For the generation of Ne RF in the first stage, we utilize sparsity loss and orientation loss (Verbin et al., 2022) to constrain the geometry. The weight of the orientation loss linearly increases from 1 to 100 in the first 5000 steps, while the sparsity loss linearly decreases from 10 to 1 during the same period. After 5,000 iterations, we replace the RGB outputs as normal images with a probability of 0.5 for surface refinement. The weight of the negative classifier score is gradually annealed from 0 to 1 as the shape progressively takes form. To mitigate the Janus problem, we employ Perp-Neg (Armandpour et al., 2023) with a loss weight set to 3 and additionally constrain the camera view to focus only on the front and back views in the first 1,000 iterations. For prompts without clear directional objects, we omit Perp-Neg and use a larger batch size of 4. This extends training time by approximately 40 minutes. For the second stage of mesh refinement, we use a normal consistency loss with a weight of 10,000 and replace the RGB outputs as normal images with a probability of 0.5 for surface refinement. The weight of the negative classifier score is annealed from 1 to 0.5. For both stages, we use negative prompts oversaturated color, ugly, tiling, low quality, noisy. For texture generation, we optimize over 30,000 iterations. To achieve better alignment of geometry and texture, we use Control Nets as guidance. Specifically, we employ canny-control and depth-control, where canny-control applies canny edge detection to the rendered normal image and uses the edge map as a condition. For both conditions, we start with a weight of 0.5 for the first 1000 steps, then reduce it to 0.2. |