Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
How to Evaluate Reward Models for RLHF
Authors: Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph E Gonzalez, Ion Stoica
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback)... To investigate which reward model metrics are most correlated to gold-standard RLHF outcomes, we launch an end-to-end RLHF experiment on a large-scale crowd-sourced human preference platform to view real reward model downstream performance as ground truth. Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE)... |
| Researcher Affiliation | Academia | Evan Frick Tianle Li Connor Chen Wei-Lin Chiang Anastasios N. Angelopoulos Jiantao Jiao Banghua Zhu Joseph E. Gonz alez Ion Stoica UC Berkeley |
| Pseudocode | No | The paper describes methodologies in text and provides prompt templates for benchmarks in Appendix A.3.1, but it does not contain any structured pseudocode blocks or algorithms. |
| Open Source Code | Yes | Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE), the first reward model benchmark explicitly linked to post-RLHF real-world human preference performance, which we opensource for public use and further development at github.com/lmarena/PPE. |
| Open Datasets | Yes | Additionally, we release PPE, a crowdsourced collection of 16,038 labeled human preference pairs... as well as a dataset of 2,555 prompts... all grounded with verifiable correctness labels. PPE evaluates reward models on 12 different metrics and 12 different domains... For the correctness metrics, we selected standard, widely used, reputable, and verifiable benchmarks: MMLU Pro (Wang et al., 2024b), MATH (Hendrycks et al., 2021), GPQA (Rein et al., 2023), MBPP Plus (Austin et al., 2021), and IFEval (Zhou et al., 2023). |
| Dataset Splits | Yes | We create a training dataset by first including 7,000 prompts sampled from the original 50,000 human preference votes... We then add 500 random prompts from MMLU-Pro that are not in PPE, and another 500 prompts from MATH train set (also mutually exclusive from PPE). For each prompt, we sample 16 responses from the base model, Llama-3.1-8B-Instruct... This process yields 8,000 total prompts, each with 16 different responses, totaling 128,000 responses. ... Overall, 12,190 human votes were collected and compiled into relative rankings between these RLHF-ed LLMs. |
| Hardware Specification | Yes | Costs are calculated from Run Pod s hourly GPU pricing, which puts an NVIDIA A100 80GB PCIe instance at $1.64 per hour. |
| Software Dependencies | No | Implementation TRL DPOTrainer (von Werra et al., 2020) Optimizer Adam W, β1 = 0.9, β2 = 0.999 Space Optimization Deepspeed Zero2. The paper names specific software components (TRL DPOTrainer, Deepspeed Zero2) but does not provide explicit version numbers for them. |
| Experiment Setup | Yes | DPO Configuration: Base Model Meta-Llama-3.1-8B-Instruct, τ 0.1, Learning Rate 2.00 10−0.6, LR Schedule Constant, Global Batch Size 64, Max Length 8192, Max Prompt Length 4096, Optimizer AdamW, β1 = 0.9, β2 = 0.999. |