Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

Authors: Dongzhi JIANG, Ziyu Guo, Renrui Zhang, ZHUOFAN ZONG, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results show that our approach outperforms baseline models by 13% and 19% improvements on the T2I-Comp Bench and WISE benchmark, and even surpasses the previous state-of-the-art model FLUX.1. Qualitative analysis reveals that our method empowers the model to generate more human-aligned results... In this section, we first provide the main results of T2I-R1 in T2I-Comp Bench [28], WISE [61] and Gen AI-Bench [45] in Section 3.1. Then we present the results of different reward function combinations in Section 3.2 and the ablation study of the effectiveness of two levels of Co T in Section 3.3.
Researcher Affiliation	Academia	1CUHK MMLab 2CUHK IMIXR 3Shanghai AI Laboratory 4CPII under Inno HK EMAIL EMAIL
Pseudocode	No	The paper describes the methodology using prose and mathematical equations, without including any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	All the training code and data are available at https://github.com/Cara J7/T2I-R1.
Open Datasets	Yes	Our training dataset comprises text prompts sourced from the training set of T2I-Comp Bench [28] and [24], totaling 6,786 prompts with no images... Benchmark. We test on T2I-Comp Bench [28], WISE [61], Gen AI-Bench [45], and TIIFBench [85] to validate the effectiveness of our method.
Dataset Splits	Yes	Benchmark. We test on T2I-Comp Bench [28], WISE [61], Gen AI-Bench [45], and TIIFBench [85] to validate the effectiveness of our method... We follow the official evaluation setting of all the benchmarks.
Hardware Specification	Yes	All of our experiments are conducted on 8 H800.
Software Dependencies	No	For the reward model, we choose HPS [90] as the human preference model, Grounding DINO [49] as the object detector, and GIT [82] as the VQA model. For the ORM, we finetune LLa VA-One Vision-7B in the same manner as [24]. The paper lists the names of models and tools used but does not provide specific version numbers for software or libraries, such as Python or PyTorch versions.
Experiment Setup	Yes	We use a learning rate of 1e-6 and a beta of 0.01. Training hyperparameters: Learning rate 1e-6, Beta ̸ 0.01, Group Size G 8, Classifier-Free Guidance Scale 5, Max Gradient Norm 1.0, Batchsize 8, Training Steps 1,600, Gradient Accumulation Steps 2, Image Resolution h w 384 384