Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Incentivizing LLMs to Self-Verify Their Answers

Authors: Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, Bo An

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling.
Researcher Affiliation Collaboration Fuxiang Zhang1,2 Jiacheng Xu1,2 Chaojie Wang2 Ce Cui2 Yang Liu2 Bo An1,2 1 Nanyang Technological University, Singapore 2 Skywork AI Corresponding author. Email: EMAIL
Pseudocode Yes Algorithm 1 GRPO with Self-Verification 1: Input: Dataset D, initial policy πθ, online buffer size Tb, group size G, total training steps T 2: Initialize: Policy-aligned buffer B 3: for t = 1, . . . , T do
Open Source Code Yes Our code is available at https://github.com/mansicer/self-verification.
Open Datasets Yes For the evaluation benchmarks, we adopt the MATH500 dataset [10], AIME 2024 and 2025 problems, AMC 2023 problems, and Olympiad Bench [52], which are commonly used benchmarks for evaluating math reasoning models [50].
Dataset Splits Yes Considering previous popular implementations of GRPO on math reasoning [28, 29], we use the level 3-5 data from the math training dataset [4] to train the Qwen2.5-Math-7B model and a combined math reasoning dataset sorted by Deep Scale R [29] to train the Deep Seek-R1-Distill-Qwen-1.5B model. For the evaluation benchmarks, we adopt the MATH500 dataset [10], AIME 2024 and 2025 problems, AMC 2023 problems, and Olympiad Bench [52], which are commonly used benchmarks for evaluating math reasoning models [50].
Hardware Specification Yes For the training resources of our models, we use two nodes of 8 NVIDIA GPUs with 80GB memory each.
Software Dependencies No We use the popular verl framework [51] as the code base for RL training, where the maximal context length for Qwen2.5-Math-7B is 4k and for Deep Seek-R1-Distill-Qwen-1.5B is 16k. ... We leverage the v LLM engine [23] for efficient inference during training... We implement an efficient inference-time generation framework on top of Beeching et al. [53] but substitute the inference engine to SGLang [24]... (No specific version numbers for these software components are provided in the text).
Experiment Setup Yes In this section, we provide comprehensive details about our post-training process with self-verification. We use the popular verl framework3, which is the open-source version of Sheng et al. [51], as our RL training framework. For the training resources of our models, we use two nodes of 8 NVIDIA GPUs with 80GB memory each. The detailed training configurations and hyperparameters are listed in Table 6. ... We adopt the GRPO algorithm for both models with the same batch size: a training batch size of 128 and a PPO mini-batch size of 64. Other training-related configurations are similar to the original implementation in verl. The learning rate is set to 1e-6 for both models. For the KL loss coefficient, we use 0.001 to maintain a balance between exploration and policy improvement. The entropy coefficient is set differently: 0.001 for Qwen-7B and 0.005 for R1-1.5B. Notably, we follow He et al. [55] to enable adaptive entropy for R1-1.5B with a target entropy of 0.2, allowing the entropy coefficient to adjust between 0 and 0.005 per step, as we find that a fixed entropy coefficient for the Deep Seek-R1-Distill-1.5B model results in an extremely unstable training process. ... The training runs for 1500 steps for Qwen-7B and 2000 steps for R1-1.5B, with rejection sampling enabled to filter out invalid generations.