Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

What Makes a Reward Model a Good Teacher? An Optimization Perspective

Authors: Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate.
Researcher Affiliation Academia Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora Princeton Language and Intelligence, Princeton University
Pseudocode No The paper describes algorithms like PPO, RLOO, and GRPO in the text, and provides theoretical derivations and equations. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like steps for any method.
Open Source Code Yes Code for reproducing our results is available at https://github.com/ princeton-pli/what-makes-good-rm.
Open Datasets Yes We partition the Ultra Feedback [14] training set into two subsets: 80% of the samples are used for reward model training and the rest for the policy gradient step of RLHF. Output preferences in the reward modeling subset are relabeled using the ground truth reward. Initial policy. We SFT the pretrained Pythia-2.8B language model on Alpaca Farm.
Dataset Splits Yes We partition the Ultra Feedback [14] training set into two subsets: 80% of the samples are used for reward model training and the rest for the policy gradient step of RLHF. The resulting training and test sets had 41419 and 1329 samples, respectively.
Hardware Specification Yes Hardware. All experiments ran on Nvidia H100 GPUs with 80GB memory. For SFT and reward model training, we used a single GPU. For policy gradient (i.e., RLOO and GRPO), we used two GPUs in runs with language models of roughly 1B parameters and four GPUs in runs with language models of roughly 3B parameters.
Software Dependencies No Code for reproducing our results, based on the Py Torch [59] and Hugging Face TRL [81] frameworks, can be found at https://github.com/princeton-pli/what-makes-good-rm. The paper mentions PyTorch and Hugging Face TRL but does not provide specific version numbers for these software components, which is required for a reproducible description.
Experiment Setup Yes Generation hyperparameters. When generating outputs from a policy, we used a temperature of 1 and a maximum output length of 512 tokens SFT. We minimized the cross-entropy loss (as implemented by the TRL framework) for one epoch via the Adam optimizer [41] with a learning rate of 1e-6 and batch size of 32 (emulated via gradient accumulation steps). Reward models. We trained each reward model by minimizing the standard Bradley-Terry loglikelihood loss (as implemented by the TRL framework) for one epoch via the Adam optimizer with a learning rate of 5e-7 and batch size of 32 (emulated via gradient accumulation steps). Policy gradient. Our RLOO and GRPO implementations are based on the RLOOTrainer class from the TRL framework, which uses the Adam optimizer. We set the learning rate to 1e-7, batch size to 32 (emulated via gradient accumulation steps), and the num_mini_batches hyperparameter to 2. We kept the KL coefficient at its default value of 0.05. As in [2], for each prompt in a batch, we sampled two outputs (i.e., we set the RLOO k parameter to 2).