Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

Authors: Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, Yu Meng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train Qwen2.5-Math-7B, Qwen3-4B and Llama-3.1-8B-Instruct on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples without reinforcing correct responses can be highly effective: it consistently improves performance over the base model across the entire Pass@k spectrum (k up to 256), often matching or surpassing PPO and GRPO. ... Evaluation setup. We evaluate on three widely used math reasoning benchmarks, including the test sets of MATH, AIME 2025 and AMC23. ... We adopt a full spectrum of Pass@k as our main evaluation metric, using k {1, 2, 4, 8, 16, 32, 64, 128, 256} for Qwen2.5-Math-7B and k {1, 2, 4, 8, 16, 32, 64} for Qwen3-4B.
Researcher Affiliation	Academia	1Computer Science Department, University of Virginia 2Princeton Language and Intelligence (PLI), Princeton University EMAIL EMAIL
Pseudocode	No	The paper includes mathematical equations and derivations of gradients but does not feature any explicitly labeled pseudocode or algorithm blocks describing a procedure step-by-step.
Open Source Code	Yes	Our code is available at https://github.com/Tian Hong ZXY/RLVR-Decomposed.
Open Datasets	Yes	For the training set, we use MATH [17], which contains 7,500 problems. We evaluate on three widely used math reasoning benchmarks, including the test sets of MATH, AIME 2025 and AMC23.
Dataset Splits	Yes	For the training set, we use MATH [17], which contains 7,500 problems. We evaluate on three widely used math reasoning benchmarks, including the test sets of MATH, AIME 2025 and AMC23.
Hardware Specification	Yes	Our experiments are conducted over a single node with 8 NVIDIA H200 GPUs.
Software Dependencies	No	We train the models using the verl framework [44]. The paper does not specify a version number for the 'verl' framework or any other software dependency.
Experiment Setup	Yes	The prompt batch size is 1,024, with 8 rollouts generated per prompt. The sampling temperature during training is set to 1.0, and the maximum context length is set to 4,096, 4,096 and 32,768 tokens for Qwen2.5-Math-7B, Llama-3.1-8B-Instruct and Qwen3-4B, respectively. We update the model with a mini-batch size of 256 and a learning rate of 1e-6. More hyperparameter settings can be found in Appendix D.1. ... Both PPO and GRPO use a KL penalty coefficient of 1e-3. For PSR and NSR, we do not apply KL penalty, which we find to result in better performance. The learning rate of the critic model in PPO is 1e-5. The clip ratio is set to 0.2. We also apply entropy bonus to all the above objectives, with a coefficient of 1e-4.