Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

Authors: Wu Shuang, Youtian Lin, Yifei Zeng, Feihu Zhang, Jingxi Xu, Philip Torr, Xun Cao, Yao Yao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the superiority of our large-scale pre-trained Direct3D over previous image-to-3D approaches, achieving significantly better generation quality and generalization ability, thus establishing a new state-of-the-art for 3D content creation. Project page: https://www.neural4d.com/research/direct3d. ... 4 Experiments
Researcher Affiliation Collaboration 1Dream Tech 2Nanjing University 3University of Oxford
Pseudocode No The paper describes its model architectures and processes but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: While a portion of the data used in this research comprises private assets with significant commercial value, releasing this information would also violate the research contract signed by the authors.
Open Datasets Yes Our Direct3D is trained on a filtered subset of the Objaverse [7] dataset which consists of 160K high-quality 3D assets.
Dataset Splits No The paper does not provide specific percentages or counts for training, validation, and test splits. It mentions evaluation on a subset of the Google Scanned Objects (GSO) dataset for testing and refers to 'validation' in ablation studies but without concrete split details.
Hardware Specification No The paper mentions training on "GPU" but does not specify any particular GPU model (e.g., NVIDIA A100, V100) or other hardware details like CPU type or memory.
Software Dependencies No The paper refers to various models and optimizers used (e.g., Adam W, DINO-v2, CLIP, Di T-XL/2, DDIM) but does not provide specific version numbers for these software components or general programming environments (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes Our D3D-VAE takes as input 81,920 point clouds with normal uniformly sampled from the 3D model, along with a learnable latent token of a resolution r = 32 and a channel dimension de = 768. The encoder network consists of 1 cross-attention layer and 8 self-attention layers, with each attention layer comprising 12 heads of a dimension 64. The channel dimension of the latent representation is dz = 16. The decoder network comprises of 1 self-attention layer and 5 Res Net [11] blocks to upsample the latent representation into triplane feature maps with resolution of 256 256 and channel dimension of 32. The geometric mapping network consists of 5 linear layers with hidden dimension 64. During training, we sample 20,480 uniform points and 20,480 near-surface points for supervision. The KL regularization weight is set to λKL = 1e 6. We use the Adam W [33] optimizer with a learning rate 1e 4 and a batch size of 16 per GPU. ... Our diffusion model adopts the network configuration of Di T-XL/2 [40], which consists of 28 layers of Di T blocks. Each attention layer includes 16 heads with a dimension of 72. We train the diffusion model with 1000 denoising steps using a linear variance scheduler ranging from 1e 4 to 2e 2. We employ the Adam W optimizer with a batch size of 32 per GPU and train for 800K steps. During inference, we apply 50 steps of DDIM [52] with the guidance scale set to 7.5.