Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
Authors: Dongzhi JIANG, Ziyu Guo, Renrui Zhang, ZHUOFAN ZONG, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show that our approach outperforms baseline models by 13% and 19% improvements on the T2I-Comp Bench and WISE benchmark, and even surpasses the previous state-of-the-art model FLUX.1. Qualitative analysis reveals that our method empowers the model to generate more human-aligned results... In this section, we first provide the main results of T2I-R1 in T2I-Comp Bench [28], WISE [61] and Gen AI-Bench [45] in Section 3.1. Then we present the results of different reward function combinations in Section 3.2 and the ablation study of the effectiveness of two levels of Co T in Section 3.3. |
| Researcher Affiliation | Academia | 1CUHK MMLab 2CUHK IMIXR 3Shanghai AI Laboratory 4CPII under Inno HK EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology using prose and mathematical equations, without including any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | All the training code and data are available at https://github.com/Cara J7/T2I-R1. |
| Open Datasets | Yes | Our training dataset comprises text prompts sourced from the training set of T2I-Comp Bench [28] and [24], totaling 6,786 prompts with no images... Benchmark. We test on T2I-Comp Bench [28], WISE [61], Gen AI-Bench [45], and TIIFBench [85] to validate the effectiveness of our method. |
| Dataset Splits | Yes | Benchmark. We test on T2I-Comp Bench [28], WISE [61], Gen AI-Bench [45], and TIIFBench [85] to validate the effectiveness of our method... We follow the official evaluation setting of all the benchmarks. |
| Hardware Specification | Yes | All of our experiments are conducted on 8 H800. |
| Software Dependencies | No | For the reward model, we choose HPS [90] as the human preference model, Grounding DINO [49] as the object detector, and GIT [82] as the VQA model. For the ORM, we finetune LLa VA-One Vision-7B in the same manner as [24]. The paper lists the names of models and tools used but does not provide specific version numbers for software or libraries, such as Python or PyTorch versions. |
| Experiment Setup | Yes | We use a learning rate of 1e-6 and a beta of 0.01. Training hyperparameters: Learning rate 1e-6, Beta ΜΈ 0.01, Group Size G 8, Classifier-Free Guidance Scale 5, Max Gradient Norm 1.0, Batchsize 8, Training Steps 1,600, Gradient Accumulation Steps 2, Image Resolution h w 384 384 |