Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing
Authors: Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Hao Tian, Shilin Yan, Weihao Yu, Xingyu Zeng, Jifeng Dai, Xihui Liu, Hongsheng Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Go T framework on text-to-image generation, interactive image generation, and image editing. Experiments show quantitative improvements and qualitative benefits of our reasoning-guided approach, with ablation studies validating our design choices. |
| Researcher Affiliation | Collaboration | Rongyao Fang1 Chengqi Duan3 Kun Wang4 Linjiang Huang1 Hao Li1,5 Hao Tian4 Shilin Yan Weihao Yu1 Xingyu Zeng4,6 Jifeng Dai5 Xihui Liu3 Hongsheng Li1,2 1CUHK MMLab 2CPII under Inno HK 3HKU 4Sense Time 5Shanghai AI Lab 6SUAT |
| Pseudocode | No | The paper describes its methodology through architectural diagrams and textual explanations, but it does not contain explicit pseudocode or algorithm blocks. For instance, Figure 3 illustrates the Go T Framework with Semantic-Spatial Guidance, detailing components and data flow rather than algorithmic steps. |
| Open Source Code | No | We will release our datasets and models to facilitate future research. |
| Open Datasets | No | We define the formulation of semantic and spatial reasoning chains for visual generation and editing, and constructed the first large-scale Go T datasets, encompassing 8.4M image generation, 920K image editing samples. Creating this dataset, with its semantic-spatial annotations derived from complex MLLM-driven annotation pipelines, consumed over 3000 NVIDIA A100 GPU days. ... We will release our datasets and models to facilitate future research. |
| Dataset Splits | No | Our training process implements a two-phase approach: pretraining using LAHR-Go T, Journey DBGo T, and Omni Edit-Go T datasets (60,000 steps), followed by finetuning with FLUX-Go T, Omni Edit Go T, and SEED-Edit-Multi Turn-Go T (10,000 steps). The paper mentions using standard benchmarks for evaluation (e.g., Gen Eval, Emu-Edit), but does not explicitly provide specific training/validation/test splits (e.g., percentages or counts) for its *constructed* Go T datasets. |
| Hardware Specification | Yes | Creating this dataset, with its semantic-spatial annotations derived from complex MLLM-driven annotation pipelines, consumed over 3000 NVIDIA A100 GPU days. |
| Software Dependencies | No | We employ low-rank adaptation (Lo RA) [16] to efficiently update the Qwen2.5-VL decoder s parameters while fully optimizing the SDXL-based diffusion module. ... For both stages, we use the Adam optimizer with β1 = 0.9, β2 = 0.98, and ϵ = 1 10 6. While specific models and optimizers are mentioned, explicit version numbers for these software components (e.g., Qwen2.5-VL version, SDXL version, Adam optimizer version) are not provided. |
| Experiment Setup | Yes | Our training process implements a two-phase approach: pretraining using LAHR-Go T, Journey DBGo T, and Omni Edit-Go T datasets (60,000 steps), followed by finetuning with FLUX-Go T, Omni Edit Go T, and SEED-Edit-Multi Turn-Go T (10,000 steps). ... We adopt a cosine learning rate scheduler with 500 warmup steps and a maximum learning rate of 1 10 4. For both stages, we use the Adam optimizer with β1 = 0.9, β2 = 0.98, and ϵ = 1 10 6. We also apply a weight decay of 0.05 during training. The number of batch size is set to 128. ... On T2I task, Go T framework adopts αt = 7.5 and αs = 4.0 ... In the editing task, Go T framework adopts αt = 4.0, αs = 3.0, and αr = 1.5. |