Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step
Authors: Zheyuan Liu, Munan Ning, Qihui Zhang, Shuo Yang, Zhongrui Wang, Yiwei Yang, Xianzhe Xu, Yibing Song, Weihua Chen, Fan Wang, Li Yuan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on 3D Scene benchmarks show that Co T-Diff significantly improves spatial alignment and compositional fidelity, and outperforms the state-of-the-art method by 34.7% in complex scene spatial accuracy, validating the effectiveness of this entangled generation paradigm. |
| Researcher Affiliation | Collaboration | 1Shenzhen Graduate School, Peking University 2Hupan Lab 3DAMO Academy, Alibaba Group 4Shanghai Jiao Tong University |
| Pseudocode | No | The paper describes the methodology using prose, mathematical formulas, and diagrams, but does not include a distinct pseudocode or algorithm block. |
| Open Source Code | Yes | The complete code, datasets, and experimental setup required to reproduce the main results will be included in the supplementary material. |
| Open Datasets | Yes | To support structured layout supervision for training, we construct an automatically annotated 3D layout dataset based on Eli Gen [48] and Loose Control [4]. Building upon the original annotations of global prompts, per-entity descriptions, and 2D bounding boxes, we incorporate monocular depth estimation and segmentation models to recover entity-level depth and generate 3D bounding boxes via geometric fitting. To validate Co T-Diff, we present a new benchmark, dubbed 3DScene Bench, consisting of diverse and complex compositions of spatial relationship. On 3DScene Bench and two existing T2I benchmarks [34, 17], Co T-Diff outperforms state-of-the-art diffusion baselines... |
| Dataset Splits | No | The paper describes the composition of its datasets and how they were used for evaluation (e.g., 100 examples per setting, 200 complex prompts for user study), but does not explicitly provide training/test/validation splits for model training or evaluation in the conventional sense. |
| Hardware Specification | Yes | All models are implemented in Py Torch and executed on NVIDIA A100 GPUs. |
| Software Dependencies | No | Our method is implemented using Gemini 2.5 Pro as the default multimodal language model and FLUX.1-schnell [5] as the diffusion model. All models are implemented in Py Torch... We employed FLUX.1 dev as the pre-trained Di T. For each Lo RA training, we utilize 8 A100 GPUs(80GB)... We employ the Prodigy optimizer [26] with safeguard warmup and bias correction enabled, setting the weight decay to 0.01 following Omini Control. |
| Experiment Setup | Yes | We perform all inferences with 20 denoising steps and use 5 different random seeds for fairness. The Lo RA scale is set to 1. ... We utilize 8 A100 GPUs(80GB), a batch size of 1 per GPU. We employ the Prodigy optimizer [26] with safeguard warmup and bias correction enabled, setting the weight decay to 0.01 following Omini Control. For semantic Lo RA Lo RASL, we train the model 5000 iterations base on Eli Gen Lo RA. For depth Lo RA Lo RADL, we train the model 30000 iterations. |