Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flow-GRPO: Training Flow Matching Models via Online RL

Authors: Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di ZHANG, Wanli Ouyang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section empirically evaluates Flow-GRPO s ability to improve flow matching models on three tasks. (1) Composition Image Generation: This task requires precise object arrangement and attribute control. We report the results on Gen Eval. (2) Visual Text Rendering: a rule-based task that evaluates the accurate rendering of the text specified in the prompt. (3) Human Preference Alignment: This task aims to align T2I models with human preferences.
Researcher Affiliation	Collaboration	Jie Liu1,3,5 Gongye Liu2,3* Jiajun Liang3 Yangguang Li1 Jiaheng Liu4 Xintao Wang3 Pengfei Wan3 Di Zhang3 Wanli Ouyang1,5 1MMLab, CUHK 2Tsinghua University 3Kling Team, Kuaishou Technology 4Nanjing University 5Shanghai AI Laboratory EMAIL EMAIL
Pseudocode	No	The paper contains mathematical formulations and equations (e.g., Eq. 1-29) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps.
Open Source Code	Yes	Code: https://github.com/yifan123/flow_grpo
Open Datasets	Yes	We evaluate Flow-GRPO on T2I tasks with various reward types. (1) Verifiable rewards, using the Gen Eval [17] benchmark and visual text rendering task. (2) Model-based rewards, such as the human preference Pickscore [19] reward. ... Image Quality Evaluation Metric. ...All metrics are computed on Draw Bench [1], a comprehensive benchmark with diverse prompts for T2I models.
Dataset Splits	Yes	Training prompts are generated using official Gen Eval scripts, which apply templates and random combinations to construct the prompt dataset. The test set is strictly deduplicated... Based on the base model s initial accuracy across the six tasks, we set the prompt ratio as Position : Counting : Attribute Binding : Colors : Two Objects : Single Object = 7 : 5 : 3 : 1 : 1 : 0. ... We use GPT4o to produce 20K training prompts and 1K test prompts.
Hardware Specification	Yes	We train our model using 24 NVIDIA A800 GPUs.
Software Dependencies	No	The paper lists base models and reward models with links in Appendix B.2 but does not specify software dependencies like programming languages (e.g., Python), frameworks (e.g., PyTorch), or CUDA versions with their specific version numbers.
Experiment Setup	Yes	We use a sampling timestep T = 10 and an evaluation timestep T = 40. Other settings include a group size G = 24, an noise level a = 0.7 and an image resolution of 512. The KL ratio β is set to 0.04 for Gen Eval and Text Rendering, and 0.01 for Pickscore. We use Lora with α = 64 and r = 32.