Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Authors: Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, Jinguo Zhu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We assess the effectiveness of self-consistency sampling (SCS) when combined with several outcome-reward reinforcement-learning algorithms namely RLOO [19], REINFORCE++ series [14], and GRPO [39]. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. Our main contributions are as follows: We empirically show that outcome-reward training encourages unfaithful reasoning in multimodal multiple-choice tasks, where models often arrive at correct answers through incorrect or inconsistent reasoning processes.
Researcher Affiliation Collaboration 1Xi an Jiaotong University, 2University of Science and Technology of China, 3Shanghai Artifcial Intelligence Laboratory, 4Sense Time Research EMAIL, EMAIL EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Outcome Reward-based RL Training with Self-Consistency Sampling (SCS) Require: Dataset D = {xi}N i=1, pretrained model parameters θ, truncation ratio k (0 < k < 1), number of resample m, consistency weight c, learning rate α, reample number N. 1: Initialize optimizer O with θ 2: for each minibatch {x} in D do 3: Sample initial answer & reasoning trajectory a, τ πθ( | x); 4: r racc; 5: A ; 6: for t = 1 to m do 7: τ < Truncate(τ, k); 8: Add Noise to Image in x x + N(0, σ2 t ); 9: Sample new answer at after τ <, x ; 10: A A {at}; 11: end for 12: rcon (N |A|); 13: r rfor + racc + rcon; 14: Compute baseline of reward b; 15: Compute policy gradient g θ log πθ(a0 | x)(r b); 16: θ θ + αO(g); 17: end for 18: return Updated parameters θ
Open Source Code Yes Our code is available at https://github.com/Genuine WWD/SCS.
Open Datasets Yes As shown in Table 1, we initially aggregate our training dataset from published dataset including M3Co T [7], Science QA [35] and Geometry3K [22]. For model evaluation, we adopt 6 mainstream multimodal benchmarks. They are (1) Math Vision [43] (2) Math Verse [58] (3) We-Math [31] (4) MMMU [56] (5) M3Co T [7] and (6) Science QA [35], covering various fields of challenging problems such as Mathematics, Science, Medicine and so on, thoroughly evaluating MLLMs perception and reasoning abilities.
Dataset Splits No As shown in Table 1, we initially aggregate our training dataset from published dataset including M3Co T [7], Science QA [35] and Geometry3K [22]. Then we apply a data filter and only multiple choice questions with multi-modal inputs are kept. For model evaluation, we adopt 6 mainstream multimodal benchmarks. They are (1) Math Vision [43] (2) Math Verse [58] (3) We-Math [31] (4) MMMU [56] (5) M3Co T [7] and (6) Science QA [35]. The paper mentions
Hardware Specification Yes Each experiment was conducted with 8 A800 GPUs and took approximately 24 hours to train.
Software Dependencies No The training pipeline is implemented based on the open-source framework Open RLHF *, while the evaluation is conducted using established opensource libraries, including Transformers .
Experiment Setup Yes A Training Hyperparameters In this section, Tables 6, 7, 8, and 9 show the hyperparameters when training models with different RL algorithms (GRPO [33], REINFORCE++-baseline [14], REINFORCE++ [14], and RLOO [19]). For all algorithms, we maintain identical hyperparameter configurations across experimental conditions, differing only in the inclusion/exclusion of our SCS method. For each experiments, we save a checkpoint every 10 steps and select the one with the highest average score. Table 6: Hyperparameter settings for RLOO experiments. RLOO-Baseline RLOO-SCS Pretrained Model Qwen2.5-VL-7B-Instruct Qwen2.5-VL-7B-Instruct RL Algorithm RLOO RLOO Train Batchsize 128 128 Rollout Batchsize 128 128 Temperature 1 1 Num Samples per Prompt 16 16 Prompt Max Length 1024 1024 Generate Max Length 3000 3000 Bf16 True True Actor Learning Rate 1e-6 1e-6 Initial KL Coef 0 0 Mum Episodes 1 1 Max Epochs 1 1 Apply SCS False True Response Truncation Ratio / 0.8 Resampled Trajectories Num / 4