IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

Authors: Yiwen Chen, Chi Zhang, Xiaofeng Yang, Zhongang Cai, Gang Yu, Lei Yang, Guosheng Lin

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a comprehensive set of experiments that demonstrate the effectiveness of our method over baseline approaches. We provide comprehensive experimental results, both qualitative and quantitative, to validate the effectiveness of our methods. Our empirical findings show that our proposed method significantly improves the baseline models in terms of texture detail, geometry, and fidelity between text prompts and the resulting 3D objects.
Researcher Affiliation Collaboration Yiwen Chen*1,2, Chi Zhang*3, Xiaofeng Yang2, Zhongang Cai4, Gang Yu3, Lei Yang4, Guosheng Lin 1,2 1S-Lab, Nanyang Technological University 2School of Computer Science and Engineering, Nanyang Technological University 3Tencent PCG, China 4Sense Time Research
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link to the open-source code for the described methodology.
Open Datasets No The paper describes generating its own dataset (D') by rendering a coarse 3D model and employing image-to-image pipelines, but it does not state that this generated dataset is publicly available or provide access information for it.
Dataset Splits No The paper mentions training steps and convergence criteria but does not specify exact training, validation, and test dataset splits by percentage or sample counts.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU models, memory specifications) used for running the experiments.
Software Dependencies No The paper mentions using Control Net and Stable Diffusion models, but it does not specify software dependencies with version numbers (e.g., Python version, PyTorch version, specific library versions).
Experiment Setup Yes The learning rates for both the Ne RF and the discriminator are set to 1e-3 and 2e-3, respectively. For our baseline method, complex prompts such as avatars typically require approximately 20k steps to achieve complete convergence, while simpler prompts like flowers converge around 15k steps. To ensure a fair comparison, we train the baseline method for 25K steps (1.5 to 2.5 GPU hours) to guarantee the full convergence of the 3D model. As for our method, we resume training from 5k to 15k steps of the baseline method, varying based on the complexity of the prompt.