Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards

Authors: Honghao Chen, Xingzhou Lou, Xiaokun Feng, Kaiqi Huang, Xinlong Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across multiple benchmarks demonstrate that our approach significantly enhances reasoning capabilities, outperforming strong baselines with consistent improvements. Our empirical analysis and ablations indicate that incorporating step-level evaluation is more reasonable and effective than relying solely on answer-level evaluation. In this section, we first introduce the setting and implementation details in Section 4.1. Then we present a comprehensive comparison with state-of-the-art VLMs across multiple reasoning benchmarks in Section 4.2. In Section 4.3, we conduct extensive ablation studies and analyses to validate the effectiveness of our approach and explore several interesting properties.
Researcher Affiliation Collaboration Honghao Chen1,2,3 Xingzhou Lou1,2 Xiaokun Feng1,2 Kaiqi Huang1,2 Xinlong Wang3 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3 Beijing Academy of Artificial Intelligence
Pseudocode Yes We structure the entire thought into multiple consecutive cognitive steps, ensuring that these steps logically flow in a natural and coherent manner towards the final answer. Each step consists of the following three components: Name. Thought. Reflection. ... We use special tokens to establish this reasoning format... The reasoning structure is illustrated in Fig. 7. Figure 7: Illustration of structured reasoning. We take reasoning step (special tokens marked in blue) as the basic unit to conduct structured and fine-grained reasoning. Figure 8: Prompt used for generating step-by-step reasoning data. Figure 9: Prompt used for generating process-level annotation.
Open Source Code Yes Our dataset, PRM, and code at https://github.com/baaivision/CoS.
Open Datasets Yes Our dataset, PRM, and code at https://github.com/baaivision/CoS. We use GPT-4o [16] to construct Share GPT-Step-300K, a dataset of 300K structured step-wise reasoning samples following the designed template. Our dataset encompasses a diverse range of tasks, utilizing 17 datasets that demand various reasoning skills, such as scientific reasoning [18, 31], mathematical reasoning [51, 64] and world knowledge [41]. We plan to release this dataset to advance research in the community on fine-grained step-level reasoning.
Dataset Splits Yes To evaluate the accuracy of the PRM, we reserved 10K process-annotated reasoning data for assessment. Among these, 5K questions were from the PRM training dataset, while the other 5K were unseen during PRM’s training, allowing us to evaluate PRM’s ability to assess in-domain data and generalization capability to unseen questions.
Hardware Specification Yes All SFT and iterative DPO experiments are conducted on 8 NVIDIA-A800 GPUs by default. For PRM, it is trained on 16 NVIDIA-A800 GPUs.
Software Dependencies No The paper does not provide specific software versions for libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes Implementation details. To verify the effectiveness and generalizability of our approach, we employ LLa VA-Ne Xt [28] and Intern VL-2.5-MPO [55] as our base VLMs, respectively. For each base VLM, we first conduct one epoch of supervised fine-tuning on Share GPT-Step-300K. Then, based on the SFT model, we apply three rounds of iterative DPO to obtain the final model. For each round, we compile approximately 20K preference pairs as training data for DPO. The training recipes can be found in Appendix C. Table 7: Hyperparameter setting and training recipes, lists Learning Rate, Epoch, Warm-up Ratio, Weight Decay, Batch Size, Drop-path Rate, Update Parts, and Data Size for SFT, PRM Training, and DPO.