Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scaling Diffusion Transformers Efficiently via $\mu$P
Authors: Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan LI
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we rigorously prove that µP of mainstream diffusion Transformers, including Di T, U-Vi T, Pix Art-α, and MMDi T, aligns with that of the vanilla Transformer, enabling the direct application of existing µP methodologies. Leveraging this result, we systematically demonstrate that Di T-µP enjoys robust HP transferability. Notably, Di T-XL-2-µP with transferred learning rate achieves 2.9 faster convergence than the original Di T-XL-2. Finally, we validate the effectiveness of µP on text-to-image generation by scaling Pix Art-α from 0.04B to 0.61B and MMDi T from 0.18B to 18B. |
| Researcher Affiliation | Collaboration | 1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Beijing Key Laboratory of Research on Large Models and Intelligent Governance 3 Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE 4 Byte Dance Seed 5 RIKEN AIP 6 Dept. of Comp. Sci. & Tech., Tsinghua University |
| Pseudocode | Yes | We summarize the methodology for verifying base HP transferability across widths, batch sizes, and training steps in Algorithm 1, 2, and 3 in Appendix D, respectively. ... Once base HP transferability is validated for diffusion Transformers, we can directly apply the µTransfer algorithm [73] (see Algorithm 4 in Appendix D) to zero-shot transfer base HPs from a proxy task to a target task. |
| Open Source Code | Yes | In addition, our code is available at https://github.com/ML-GSAI/Scaling-Diffusion-Transformers-mu P. ... We also open-source our code at https://github.com/ML-GSAI/Scaling-Diffusion Transformers-mu P for Di T and Pix Art-α experiments. |
| Open Datasets | Yes | Dataset. We train Di T and Di T-µP on the Image Net training set [13]... Dataset. We use the SAM/SA-1B dataset [33]... Both FID and CLIP Score are computed on the aesthetic MJHQ-30K [39] and real MS-COCO-30K [45] datasets. |
| Dataset Splits | No | The paper mentions using the ImageNet training set, SAM/SA-1B dataset for training, and then evaluates on benchmark datasets like FID-50K, MJHQ-30K, and MS-COCO-30K. While these evaluation datasets have standard splits, the paper does not explicitly state the training/test/validation splits for the datasets it *trains* on, nor does it provide details for the internal MMDi T dataset. |
| Hardware Specification | Yes | It takes 104 (13 8) A100-80GB hours to train the Di T-µP with a width of 288, a batch size of 256, and a train iteration of 200K steps. ... It takes around 224 (32 7) A100-80GB days to reproduce the pretraining of Di T-XL-2-µP. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer [47] and other external codebases (e.g., Di T, Guided Diffusion), but it does not provide specific version numbers for the software components or libraries used to implement and run their experiments. |
| Experiment Setup | Yes | Training. We train Di T and Di T-µP using the Adam W [47]. Following the original Di T setup [52], we do not apply any learning rate schedule or weight decay, and constant learning rates are used in all experiments. The original Di T-XL-2 is trained with a learning rate 10 4 and a batch size of 256. ... We sweep the base learning rate over the set {2 13, 2 12, 2 11, 2 10, 2 9} across various widths, batch sizes, and training steps. |