Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reward Reasoning Models

Authors: Jiaxin Guo, Zewen Chi, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that RRMs achieve superior performance on reward modeling benchmarks across diverse domains.
Researcher Affiliation Collaboration Jiaxin Guo 1,2 Zewen Chi 1 Li Dong 1 Qingxiu Dong1,3 Xun Wu1 Shaohan Huang1 Furu Wei1 1 Microsoft Research 2 Tsinghua University 3 Peking University
Pseudocode No The paper describes methods like the ELO rating system and knockout tournament, and outlines the training framework, but does not present them in explicit pseudocode or algorithm blocks.
Open Source Code Yes The pretrained models are available at https://huggingface. co/Reward-Reasoning.
Open Datasets Yes Training Data Training RRMs requires diverse pairwise preference data spanning capabilities and aligns with human preference. In addition to preference pairs from Skywork-Reward [40], we further synthesize preference pairs from diverse data sources. We randomly sample 80K queries from the Tülu 3 prompt dataset [34]... Furthermore, we synthesize preferences pairs using verifiable question-answer pairs from Web Instruct-verified [44], Skywork-OR1 [24], Big-Math-RL [2], and DAPO-Math [74].
Dataset Splits No The paper describes the composition of its training data, such as 'The final training dataset comprises approximately 420K preference pairs: 80K each from Skywork-Reward, Tülu-80K, our GPT-4o-labeled preference pairs, and the other synthetic data,' but does not explicitly specify how this combined data was split into training, validation, and test sets. It refers to 'widely-used benchmarks for reward modeling' for evaluation, which implies using their predefined splits, but not for its own synthesized data.
Hardware Specification Yes The RRM training framework is implemented using the verl library [55], and we train both RRM-7B and RRM-32B models on AMD Instinct MI300X Accelerators.
Software Dependencies No We use Deepseek-R1 distilled models [22] as base models, applying group relative policy optimization (GRPO) [72] for training, implemented with the verl library [55]. (The text names libraries but no versions.)
Experiment Setup Yes More implementation details and hyperparameters can be found in Section 4.1 and Appendix A.2. ... Table 7: Hyperparameters used for training RRMs. Batch size 128 Mini-batch size 64 KL loss coefficient 10^-3 Sampling temperature 0.6 Maximum prompt length 4096 Maximum response length 8192 GRPO group size 16 Learning rate (RRM-32B) 5 * 10^-7 Learning rate (RRM-7B) 10^-6