Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RMB: Comprehensively benchmarking reward models in LLM alignment

Authors: Enyu Zhou, Guodong Zheng, Binghai Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address these limitations, we propose RMB, a comprehensive RM benchmark that covers over 49 real-world scenarios and includes both pairwise and Best-of-N (Bo N) evaluations to better reflect the effectiveness of RMs in guiding alignment optimization. We demonstrate a positive correlation between our benchmark and downstream alignment task performance. Based on our benchmark, we conduct extensive analysis on the state-of-the-art RMs, revealing their generalization defects that were not discovered by previous benchmarks, and highlighting the potential of generative RMs.
Researcher Affiliation Academia 1 School of Computer Science, Fudan University 2 Institute of Modern Languages and Linguistics, Fudan University 3 UNC Chapel Hill 4 Pengcheng Laboratory EMAIL
Pseudocode No The paper describes methods and calculations in text and mathematical formulas but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our evaluation code and datasets are available at https://github.com/Zhou-Zoey/RMB-Reward-Model-Benchmark.
Open Datasets Yes Our evaluation code and datasets are available at https://github.com/Zhou-Zoey/RMB-Reward-Model-Benchmark.
Dataset Splits No Within this diverse set of tasks, we consider two types of partial ordering relationships corresponding to the pairwise set and Best-of-N set in the benchmark. The pairwise set consists of the (chosen,rejected) pairs and requires the reward model to select the superior response from two answers. Beyond preference pair evaluation, we propose a Best-of-N (Bo N) test set, as a new benchmark paradigm of RM evaluation. The Bo N test set is constructed by (query, winner, list of losers) triplets, demanding that the reward model identify the single best answer from multiple responses.
Hardware Specification No The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, or cloud instance specifications) used for running its experiments.
Software Dependencies No The paper mentions models and tools like "GPT-4-turbo-2024-04-09", "Llama-2", "Qwen-2-72B", "Llama-guard-2", "Llama3-guard", and "sentence-transformers", but it does not provide specific version numbers for software dependencies or frameworks used for their own implementation.
Experiment Setup No The paper describes the evaluation setup for reward models, including how pairwise and Bo N accuracy are calculated, and refers to external benchmarks. However, it does not provide specific hyperparameters (e.g., learning rate, batch size) or training configurations for any models implemented or trained by the authors for their experiments.