Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

How to Evaluate Reward Models for RLHF

Authors: Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph E Gonzalez, Ion Stoica

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback)... To investigate which reward model metrics are most correlated to gold-standard RLHF outcomes, we launch an end-to-end RLHF experiment on a large-scale crowd-sourced human preference platform to view real reward model downstream performance as ground truth. Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE)...
Researcher Affiliation	Academia	Evan Frick Tianle Li Connor Chen Wei-Lin Chiang Anastasios N. Angelopoulos Jiantao Jiao Banghua Zhu Joseph E. Gonz alez Ion Stoica UC Berkeley
Pseudocode	No	The paper describes methodologies in text and provides prompt templates for benchmarks in Appendix A.3.1, but it does not contain any structured pseudocode blocks or algorithms.
Open Source Code	Yes	Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE), the first reward model benchmark explicitly linked to post-RLHF real-world human preference performance, which we opensource for public use and further development at github.com/lmarena/PPE.
Open Datasets	Yes	Additionally, we release PPE, a crowdsourced collection of 16,038 labeled human preference pairs... as well as a dataset of 2,555 prompts... all grounded with verifiable correctness labels. PPE evaluates reward models on 12 different metrics and 12 different domains... For the correctness metrics, we selected standard, widely used, reputable, and verifiable benchmarks: MMLU Pro (Wang et al., 2024b), MATH (Hendrycks et al., 2021), GPQA (Rein et al., 2023), MBPP Plus (Austin et al., 2021), and IFEval (Zhou et al., 2023).
Dataset Splits	Yes	We create a training dataset by first including 7,000 prompts sampled from the original 50,000 human preference votes... We then add 500 random prompts from MMLU-Pro that are not in PPE, and another 500 prompts from MATH train set (also mutually exclusive from PPE). For each prompt, we sample 16 responses from the base model, Llama-3.1-8B-Instruct... This process yields 8,000 total prompts, each with 16 different responses, totaling 128,000 responses. ... Overall, 12,190 human votes were collected and compiled into relative rankings between these RLHF-ed LLMs.
Hardware Specification	Yes	Costs are calculated from Run Pod s hourly GPU pricing, which puts an NVIDIA A100 80GB PCIe instance at $1.64 per hour.
Software Dependencies	No	Implementation TRL DPOTrainer (von Werra et al., 2020) Optimizer Adam W, β1 = 0.9, β2 = 0.999 Space Optimization Deepspeed Zero2. The paper names specific software components (TRL DPOTrainer, Deepspeed Zero2) but does not provide explicit version numbers for them.
Experiment Setup	Yes	DPO Configuration: Base Model Meta-Llama-3.1-8B-Instruct, τ 0.1, Learning Rate 2.00 10−0.6, LR Schedule Constant, Global Batch Size 64, Max Length 8192, Max Prompt Length 4096, Optimizer AdamW, β1 = 0.9, β2 = 0.999.