Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
Authors: Yicheng Xiao, Lin Song, Yukang Chen, Yingmin Luo, Yuxin Chen, Yukang Gan, Wei Huang, Xiu Li, Xiaojuan Qi, Ying Shan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that Mind Omni outperforms existing models, achieving impressive performance on both understanding and generation benchmarks, meanwhile showcasing advanced fine-grained reasoning generation capabilities, especially with mathematical reasoning instruction. All codes will be made public at https://github.com/Tencent ARC/Mind Omni. We also conduct extensive experiments to validate the effectiveness of Mind Omni on both understanding and generation benchmarks. |
| Researcher Affiliation | Collaboration | 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2ARC Lab, Tencent PCG 3The Chinese University of Hong Kong 4The University of Hong Kong EMAIL EMAIL |
| Pseudocode | No | The paper describes methods in prose and diagrams (Figures 2 and 3), but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | All codes will be made public at https://github.com/Tencent ARC/Mind Omni. |
| Open Datasets | Yes | We utilize the open-sourced image-caption pairs [36, 39] and in-house data as training data. Additionally, we incorporate the X2I dataset [52], which includes tasks for computer vision [54], in-context learning, multi-modal instructions, and subject-driven. |
| Dataset Splits | Yes | We evaluate our method during multimodal understanding and various generation benchmarks. Specifically, our model is evaluated on MMMU [63], MMBench [62] and Realworld QA for image understanding. As for basic image generation tasks, we evaluate our model on Gen Eval [12], which involves various metrics such as counting, colors, and position. We also evaluate our text-to-image generation capability on DPG-Bench [16] following previous methods [5, 34]. For reasoning generation evaluation on WISE [27] Benchmark, we perform our Mind Omni into thinking mode and generate the corresponding image with 1024 1024. |
| Hardware Specification | No | The paper describes training details such as batch size and image resolution, but does not specify the particular hardware (e.g., GPU models, CPU types) used for the experiments in the provided text. |
| Software Dependencies | No | The paper mentions building Mind Omni based on Qwen2.5-VL and Omni Gen, and leveraging Sana and Flux for data generation, but does not provide specific version numbers for software dependencies or libraries used in their implementation. |
| Experiment Setup | Yes | During training, we use a constant learning rate scheduler with an initial rate of 1 10 4 and a weight decay of 0.05. The model is trained with a batch size of 1024 using images at a resolution of 256 256. In the second stage, we employ Qwen2.5-VL [1] to generate Co T instruction data. At this stage, we progressively increase the training resolution to 512 512, while reducing the learning rate to 5 10 5. In the final stage, we utilize Qwen3 [44] to generate a dataset of logical reasoning texts, which serve as training data for reinforcement learning. We adopt a cosine scheduler following previous works [52, 53, 55] during the reinforcement learning phase. |