Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Authors: Yibin Wang, li zhimin, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments confirm that incorporating long Co T reasoning significantly enhances the accuracy of reward signals. Extensive experiments demonstrate that incorporating long Co T reasoning significantly improves the accuracy and reliability of reward signals.
Researcher Affiliation Collaboration Yibin Wang1,2,4 , Zhimin Li4 , Yuhang Zang3 , Chunyu Wang4, Qinglin Lu4 , Cheng Jin1,2 , Jiaqi Wang2,3 1College of Computer Science and Artificial Intelligence, Fudan University, 2Shanghai Innovation Institute 3Shanghai AI Lab, 4Hunyuan, Tencent
Pseudocode No The paper describes the methodology in prose and through diagrams (e.g., Figure 2) but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes We have released all the code in the supplementary and prepared a detailed file "readme.md" for reproducibility.
Open Datasets Yes For Image Generation, we utilize HPD (25.6K) [Christodoulou and Kuhlmann-Jørgensen, 2024], OIP (7.4K)1, Eval Muse (3K) [Han et al., 2024], all preprocessed by [Wang et al., 2025d], as well as Open AI-4o_t2i_human_preference (6.7K) collected by Rapidata2. For Video Generation, we employ Video DPO (10K) [Liu et al., 2024] and Text2Video-Human Preferences (5.7K), also collected by Rapidata. For Image Understanding, we sample 30K data from LLa VA-Critic-113K [Xiong et al., 2024]. For Video Understanding, we adopt Share GPTVideo-DPO (17K) [Zhang et al., 2024b].
Dataset Splits Yes Evaluations. We evaluate image and video understanding reward assessment on VLReward Bench [Li et al., 2024b] and Share GPTVideo [Zhang et al., 2024b], using 5K test samples, respectively. For generation evaluation, we adopt Gen AI-Bench [Jiang et al., 2024], which covers both image and video reward benchmarks. Additionally, we utilize Video Gen-Reward Bench [Liu et al., 2025a] to further assess video generation. In the cold-start stage, we distill image generation Co T reward reasoning samples from GPT-4o, constructing our Image Gen-Co T-Reward-5K. The input data are randomly sampled from the image generation datasets, with the remaining data reserved for the subsequent training stages.
Hardware Specification Yes For both the cold-start and rejection sampling stages, training is performed with a batch size of 1, 16 gradient accumulation steps, a learning rate of 2.5 10 6, and a warm-up ratio of 0.3, using 8 NVIDIA H100 (80GB) GPUs. For GRPO, training is conducted with a batch size of 1, a single gradient accumulation step, a learning rate of 1 10 6, and a KL penalty coefficient of β = 0.04. The number of generated responses N is set to 8, using 64 NVIDIA H20 (97GB) GPUs.
Software Dependencies No The paper refers to using specific models like GPT-4o and Qwen2.5-VL but does not explicitly list software dependencies such as libraries or frameworks with their version numbers.
Experiment Setup Yes For both the cold-start and rejection sampling stages, training is performed with a batch size of 1, 16 gradient accumulation steps, a learning rate of 2.5 10 6, and a warm-up ratio of 0.3, using 8 NVIDIA H100 (80GB) GPUs. For GRPO, training is conducted with a batch size of 1, a single gradient accumulation step, a learning rate of 1 10 6, and a KL penalty coefficient of β = 0.04. The number of generated responses N is set to 8, using 64 NVIDIA H20 (97GB) GPUs.