Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
Authors: Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, Yu Meng
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train Qwen2.5-Math-7B, Qwen3-4B and Llama-3.1-8B-Instruct on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples without reinforcing correct responses can be highly effective: it consistently improves performance over the base model across the entire Pass@k spectrum (k up to 256), often matching or surpassing PPO and GRPO. ... Evaluation setup. We evaluate on three widely used math reasoning benchmarks, including the test sets of MATH, AIME 2025 and AMC23. ... We adopt a full spectrum of Pass@k as our main evaluation metric, using k {1, 2, 4, 8, 16, 32, 64, 128, 256} for Qwen2.5-Math-7B and k {1, 2, 4, 8, 16, 32, 64} for Qwen3-4B. |
| Researcher Affiliation | Academia | 1Computer Science Department, University of Virginia 2Princeton Language and Intelligence (PLI), Princeton University EMAIL EMAIL |
| Pseudocode | No | The paper includes mathematical equations and derivations of gradients but does not feature any explicitly labeled pseudocode or algorithm blocks describing a procedure step-by-step. |
| Open Source Code | Yes | Our code is available at https://github.com/Tian Hong ZXY/RLVR-Decomposed. |
| Open Datasets | Yes | For the training set, we use MATH [17], which contains 7,500 problems. We evaluate on three widely used math reasoning benchmarks, including the test sets of MATH, AIME 2025 and AMC23. |
| Dataset Splits | Yes | For the training set, we use MATH [17], which contains 7,500 problems. We evaluate on three widely used math reasoning benchmarks, including the test sets of MATH, AIME 2025 and AMC23. |
| Hardware Specification | Yes | Our experiments are conducted over a single node with 8 NVIDIA H200 GPUs. |
| Software Dependencies | No | We train the models using the verl framework [44]. The paper does not specify a version number for the 'verl' framework or any other software dependency. |
| Experiment Setup | Yes | The prompt batch size is 1,024, with 8 rollouts generated per prompt. The sampling temperature during training is set to 1.0, and the maximum context length is set to 4,096, 4,096 and 32,768 tokens for Qwen2.5-Math-7B, Llama-3.1-8B-Instruct and Qwen3-4B, respectively. We update the model with a mini-batch size of 256 and a learning rate of 1e-6. More hyperparameter settings can be found in Appendix D.1. ... Both PPO and GRPO use a KL penalty coefficient of 1e-3. For PSR and NSR, we do not apply KL penalty, which we find to result in better performance. The learning rate of the critic model in PPO is 1e-5. The clip ratio is set to 0.2. We also apply entropy bonus to all the above objectives, with a coefficient of 1e-4. |