Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning

Authors: Yu Zhang, Jialei Zhou, Xinchen Li, Qi Zhang, Zhongwei Wan, Duoqian Miao, Changwei Wang, Longbing Cao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate the effectiveness of our proposed Di T-ST in mitigating the complete-text comprehension defect.
Researcher Affiliation Academia 1Tongji University 2The Ohio State University 3Macquarie University
Pseudocode No The paper describes methods through textual descriptions and diagrams (e.g., Figure 3), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Datasets and models are available. (from Abstract) We release both source code and necessary data to facilitate faithful reproduction of our results. The code repository includes training and inference scripts and environment configuration files. In addition, we provide access to all datasets used in our experiments on huggingface. (from NeurIPS Paper Checklist, Question 5)
Open Datasets Yes Datasets and models are available. (from Abstract) We adopt 100K samples from the SAM-LLa VA dataset [51], which contains semantically rich, long captions suitable for our split-text method. (from Section 4.1) We construct COCO-5K, following settings from Cog View3 [64] and -Di T [65]. (from Section 4.1) For training, we observe that existing datasets used by baseline models MM-Di T [26] are suboptimal for complex caption-conditioned generation. Specifically, datasets like CC12M [62] and ImageNet [63]. (from Appendix A)
Dataset Splits Yes For evaluation, we construct COCO-5K, following settings from Cog View3 [64] and -Di T [65]. It comprises 5,000 image-text pairs sampled across multiple caption-length intervals to assess performance under varying textual complexity. Additionally, we report g FID scores based on COCO30K to assess distribution-level image fidelity. (from Appendix A) To reduce metric variance and ensure fair model comparison, we construct a fixed 30K image-text subset from COCO for consistent evaluation across architectures and training scales. (from Appendix C) Table 4: CLIPScore performance comparisons on various caption length in Selected COCO-5K.
Hardware Specification Yes We specify in the main paper that all experiments were conducted on NVIDIA A100 GPUs.Each training run took approximately 12 24 GPU hours. (from NeurIPS Paper Checklist, Question 8)
Software Dependencies No The paper mentions using specific text encoders like CLIP-L/14, CLIP-G/14, and T5 XXL, and leveraging Qwen Plus API and Hugging Face libraries (Transformers and Diffusers), but it does not specify explicit version numbers for these software components or programming languages used.
Experiment Setup Yes The training objective comprises two components while employing a mixing coefficient λ: L = LCFM + λLattn, (10) The other is Lattn, which constrains the staged cross-attention by aligning injected semantic primitives with split text while regularizing attention behavior for stable semantic injection. It is composed of three components, weighted with empirically determined ratios of α = 0.6, β = 0.25, and η = 0.15: Lattn = αLinject + βLconv + ηLmutex. (11) (from Section 3.4) We set the moving average window size w to 3, allowing each b t to consider the current and two preceding steps for better capturing the SNR trend. To ensure numerical stability when computing normalized differences, we set θ = 10 8. For identifying the convergence point, we use a convergence threshold τ = 10 4. During inference, we select 40 denoising steps from the full diffusion range (0 1000) via uniform random sampling. (from Appendix B)