Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

Authors: Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Hyeonsu Kim, Jaehoon Ko, Junho Kim, Jin-Hwa Kim, Jiyoung Lee, Seungryong Kim

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally demonstrate how our method notably enhances the 3D consistency of generated scenes compared to previous baselines, achieving state-of-the-art performance in geometric robustness and fidelity. The project page is available at https://ku-cvlab.github.io/3DFuse/. We extensively demonstrate the effectiveness of our framework with qualitative analyses and ablation studies. Moreover, we introduce an innovative metric for quantitatively assessing the 3D consistency of the generated scenes.
Researcher Affiliation Collaboration Junyoung Seo 1 Wooseok Jang 1 Min-Seop Kwak 1 Hyeonsu Kim1 Jaehoon Ko1 Junho Kim2 Jin-Hwa Kim 2,3 Jiyoung Lee 2 Seungryong Kim 1 1Korea University 2NAVER AI Lab 3AI Institute of Seoul National University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The project page is available at https://ku-cvlab.github.io/3DFuse/
Open Datasets Yes We use the Co3D (Reizenstein et al., 2021) dataset to train our sparse depth injector. The dataset is comprised of 50 categories and 5,625 annotated point cloud videos.
Dataset Splits No The paper describes the training data for the sparse depth injector and the evaluation setup for the main model (user study), but does not provide explicit train/validation/test splits or percentages for the datasets used to train or evaluate the primary model.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions using a "Stable Diffusion model based on LDM" and specifies a "Stable Diffusion v1.5 checkpoint". It also references libraries like "Py Torch3D (Ravi et al., 2020)", and models like "Control Net (Zhang & Agrawala, 2023)", "Mi Da S (Ranftl et al., 2020)", "Karlo (Donghoon Lee et al., 2022) based on un CLIP (Ramesh et al., 2022)". While specific versions are given for some models/checkpoints, a comprehensive list of all required software dependencies with their specific version numbers (e.g., Python, PyTorch, CUDA versions) is not provided to ensure full reproducibility of the software environment.
Experiment Setup Yes We use scaling factors of 0.3 and 1.0 on the features passing through the Lo RA layers and the sparse depth injector, respectively. To ease of training, we start from the weights (Zhang & Agrawala, 2023) trained on the text-to-image pairs with Mi Da S depth and fine-tune the model using the sparse depth maps synthesized from the Co3D dataset for 2 additional epochs.