Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
RMB: Comprehensively benchmarking reward models in LLM alignment
Authors: Enyu Zhou, Guodong Zheng, Binghai Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address these limitations, we propose RMB, a comprehensive RM benchmark that covers over 49 real-world scenarios and includes both pairwise and Best-of-N (Bo N) evaluations to better reflect the effectiveness of RMs in guiding alignment optimization. We demonstrate a positive correlation between our benchmark and downstream alignment task performance. Based on our benchmark, we conduct extensive analysis on the state-of-the-art RMs, revealing their generalization defects that were not discovered by previous benchmarks, and highlighting the potential of generative RMs. |
| Researcher Affiliation | Academia | 1 School of Computer Science, Fudan University 2 Institute of Modern Languages and Linguistics, Fudan University 3 UNC Chapel Hill 4 Pengcheng Laboratory EMAIL |
| Pseudocode | No | The paper describes methods and calculations in text and mathematical formulas but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Our evaluation code and datasets are available at https://github.com/Zhou-Zoey/RMB-Reward-Model-Benchmark. |
| Open Datasets | Yes | Our evaluation code and datasets are available at https://github.com/Zhou-Zoey/RMB-Reward-Model-Benchmark. |
| Dataset Splits | No | Within this diverse set of tasks, we consider two types of partial ordering relationships corresponding to the pairwise set and Best-of-N set in the benchmark. The pairwise set consists of the (chosen,rejected) pairs and requires the reward model to select the superior response from two answers. Beyond preference pair evaluation, we propose a Best-of-N (Bo N) test set, as a new benchmark paradigm of RM evaluation. The Bo N test set is constructed by (query, winner, list of losers) triplets, demanding that the reward model identify the single best answer from multiple responses. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, or cloud instance specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions models and tools like "GPT-4-turbo-2024-04-09", "Llama-2", "Qwen-2-72B", "Llama-guard-2", "Llama3-guard", and "sentence-transformers", but it does not provide specific version numbers for software dependencies or frameworks used for their own implementation. |
| Experiment Setup | No | The paper describes the evaluation setup for reward models, including how pairwise and Bo N accuracy are calculated, and refers to external benchmarks. However, it does not provide specific hyperparameters (e.g., learning rate, batch size) or training configurations for any models implemented or trained by the authors for their experiments. |