Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention
Authors: Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, yajie bao, Jiachen Qian, Siyu Zhu, Xun Cao, Philip H.S. Torr, Yao Yao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our approach successfully achieves high-quality and efficient gigascale 3D generation, a milestone previously unattainable by explicit 3D latent diffusion methods. Compared to prior native 3D diffusion techniques, our model consistently generates highly detailed 3D shapes while significantly reducing computational costs. Notably, Direct3D-S2 requires only 8 GPUs to train on public datasets [8, 9, 20] at a resolution of 10243, in stark contrast to prior state-of-the-art methods, which typically require 32 or more GPUs even for training at 2563 resolution. |
| Researcher Affiliation | Collaboration | 1Nanjing University 2Dream Tech 3Fudan University 4University of Oxford |
| Pseudocode | Yes | A Algorithm of Spatial Sparse Attention In our Triton-based implementation of the spatial blockwise selection attention kernel, two key challenges arise within sparse 3D voxel structures: 1) the number of tokens varies across different blocks, and 2) tokens within the same block may not be contiguous in HBM. To address these, we first sort the input tokens based on their block indices, then compute the starting index C of each block as kernel input. In the inner loop, C dynamically governs the loading of corresponding block tokens. The complete procedure of forward pass is formalized in Algorithm 1. Algorithm 1 Spatial Blockwise Selection Attention Forward Pass |
| Open Source Code | No | Answer: [No] Justification: We will release the code after the completion of the review process. |
| Open Datasets | Yes | Our Direct3D-S2 is trained on publicly available 3D datasets including Objaverse [9], Objaverse XL [8], and Shape Net [5]. |
| Dataset Splits | No | Our Direct3D-S2 is trained on publicly available 3D datasets including Objaverse [9], Objaverse XL [8], and Shape Net [5]. Due to the prevalence of low-quality meshes in these collections, we curated approximately 452k 3D assets through rigorous filtering for training. Following prior approach [48] in geometry processing, we first convert the original non-watertight meshes into watertight ones, then compute ground-truth SDF volumes that serve as both input to and supervision for our SS-VAE. For training our image-conditioned Di T, we render 45 RGB images per mesh at 1024 1024 resolution with random camera parameters. [...] For the ablation studies, we conduct qualitative comparisons on this benchmark, and perform quantitative evaluations on a subset from the Objaverse dataset that does not overlap with the training set. |
| Hardware Specification | Yes | We first conduct multi-resolution training using SDF volumes at three resolutions of {2563, 3843, 5123} over a period of one day on 8 A100 GPUs, with a batch size of 4 per GPU. Subsequently, we fine-tune the SS-VAE for one additional day at 10243 resolution with a learning rate of 1e−5 with a batch size of 1 per GPU. [...] For the Di T, we implement a progressive training strategy that gradually increases the resolution from 2563 to 10243 to accelerate convergence. Tab 3 presents the average latent token count, learning rate, batch size, and training duration settings at different resolutions. We employ the Adam W optimizer and trained the model for a total of 7 days on 8 A100 GPUs. |
| Software Dependencies | Yes | We implemented a custom Triton [36] GPU kernel for SSA, achieving a 3.9 speedup in the forward pass and a 9.6 speedup in the backward pass compared to Flash Attention-2 at 10243 resolution. [...] We compare the forward and backward execution times of our SSA with those of Flash Attention-2 [7] across various token counts, using the implementation from Xformers [14] for Flash Attention-2. |
| Experiment Setup | Yes | The downsampling factor f for the encoder is set to 8, and the channel dimension of the latent representation z is configured to 16. The weights for the various losses are set as: λin = 1.0, λext = 1e−1, λsharp = 1.0, and λKL = 1e−3. We employ the Adam W [24] optimizer with an initial learning rate of 1e−4. [...] Our SS-Di T comprises 24 layers of Di T blocks with a hidden dimension of 1024. We employ Grouped-Query Attention (GQA) [4] with a group number set to 2, where each group contains 16 attention heads. The hidden dimension of each head is configured as 32. For the spatial sparse attention (SSA) mechanism, we configure the resolution of the compression blocks to mcmp = 4, the resolution of the selection blocks to mslc = 8, and the size of the sparse 3D windows mwin = 8. We utilize DINO-v2 Large [28] to extract features from conditional images, with input images having a resolution of 518 × 518. |