Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer
Authors: Wu Shuang, Youtian Lin, Yifei Zeng, Feihu Zhang, Jingxi Xu, Philip Torr, Xun Cao, Yao Yao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the superiority of our large-scale pre-trained Direct3D over previous image-to-3D approaches, achieving significantly better generation quality and generalization ability, thus establishing a new state-of-the-art for 3D content creation. Project page: https://www.neural4d.com/research/direct3d. ... 4 Experiments |
| Researcher Affiliation | Collaboration | 1Dream Tech 2Nanjing University 3University of Oxford |
| Pseudocode | No | The paper describes its model architectures and processes but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: While a portion of the data used in this research comprises private assets with significant commercial value, releasing this information would also violate the research contract signed by the authors. |
| Open Datasets | Yes | Our Direct3D is trained on a filtered subset of the Objaverse [7] dataset which consists of 160K high-quality 3D assets. |
| Dataset Splits | No | The paper does not provide specific percentages or counts for training, validation, and test splits. It mentions evaluation on a subset of the Google Scanned Objects (GSO) dataset for testing and refers to 'validation' in ablation studies but without concrete split details. |
| Hardware Specification | No | The paper mentions training on "GPU" but does not specify any particular GPU model (e.g., NVIDIA A100, V100) or other hardware details like CPU type or memory. |
| Software Dependencies | No | The paper refers to various models and optimizers used (e.g., Adam W, DINO-v2, CLIP, Di T-XL/2, DDIM) but does not provide specific version numbers for these software components or general programming environments (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | Our D3D-VAE takes as input 81,920 point clouds with normal uniformly sampled from the 3D model, along with a learnable latent token of a resolution r = 32 and a channel dimension de = 768. The encoder network consists of 1 cross-attention layer and 8 self-attention layers, with each attention layer comprising 12 heads of a dimension 64. The channel dimension of the latent representation is dz = 16. The decoder network comprises of 1 self-attention layer and 5 Res Net [11] blocks to upsample the latent representation into triplane feature maps with resolution of 256 256 and channel dimension of 32. The geometric mapping network consists of 5 linear layers with hidden dimension 64. During training, we sample 20,480 uniform points and 20,480 near-surface points for supervision. The KL regularization weight is set to λKL = 1e 6. We use the Adam W [33] optimizer with a learning rate 1e 4 and a batch size of 16 per GPU. ... Our diffusion model adopts the network configuration of Di T-XL/2 [40], which consists of 28 layers of Di T blocks. Each attention layer includes 16 heads with a dimension of 72. We train the diffusion model with 1000 denoising steps using a linear variance scheduler ranging from 1e 4 to 2e 2. We employ the Adam W optimizer with a batch size of 32 per GPU and train for 800K steps. During inference, we apply 50 steps of DDIM [52] with the guidance scale set to 7.5. |