Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SparseDiT: Token Sparsification for Efficient Diffusion Transformer
Authors: Shuning Chang, Pichao WANG, Jiasheng Tang, Fan Wang, Yi Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate Sparse Di T s effectiveness, achieving a 55% reduction in FLOPs and a 175% improvement in inference speed on Di T-XL with similar FID score on 512 512 Image Net, a 56% reduction in FLOPs across video generation datasets, and a 69% improvement in inference speed on Pix Art-α on text-to-image generation task with a 0.24 FID score decrease. Sparse Di T provides a scalable solution for high-quality diffusion-based generation compatible with sampling optimization techniques. |
| Researcher Affiliation | Collaboration | Shuning Chang1 2 3 Pichao Wang2 Jiasheng Tang2 3 Fan Wang2 3 Yi Yang1 1Zhejiang University 2Damo Academy, Alibaba Group 3Hupan Lab EMAIL |
| Pseudocode | No | The paper describes the Sparse Di T architecture and strategies using text and mathematical equations (e.g., Eq 1, 2, 3, 4, 5) but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/changsn/Sparse Di T. |
| Open Datasets | Yes | Our experiments demonstrate Sparse Di T s effectiveness, achieving a 55% reduction in FLOPs and a 175% improvement in inference speed on Di T-XL with similar FID score on 512 512 Image Net [11] images, a 56% reduction in FLOPs across video generation datasets, including Face Forensics [45], Sky Timelapse [60], UCF101 [54], and Taichi-HD [50]. Additionally, on the more challenging text-to-image generation task, we achieve a 69% improvement in inference speed on Pix Art-α with a 0.24 FID score reduction. |
| Dataset Splits | Yes | We conduct our experiments on Image Net-1k [11] at resolutions of 256 256 and 512 512, following the protocol established in Di T. For Di T-XL, the model consists of 2, 24, and 2 transformers in the bottom, middle, and top segments, respectively. ... Following prior works, we sample 50,000 images to compute the Fréchet Inception Distance (FID) [17] using the ADM Tensor Flow evaluation suite [12], along with the Inception Score (IS) [46], s FID [38], and Precision-Recall metrics [23]. |
| Hardware Specification | Yes | Throughput is evaluated with a batch size of 128 on an Nvidia A100 GPU. |
| Software Dependencies | No | The paper mentions using the ADM TensorFlow evaluation suite, but it does not specify version numbers for any key software components or libraries used in their implementation. |
| Experiment Setup | Yes | All training settings and hyperparameters follow their respective papers. Fine-tuning requires approximately 6% of the time needed for training from scratch, e.g., 400K iterations for Di T-XL fine-tuning. ... Classifier-free guidance [19] (CFG) is set to 1.5 for evaluation and 4.0 for visualization. Throughput is evaluated with a batch size of 128 on an Nvidia A100 GPU. |