Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Authors: Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that RM-BENCH strongly correlates with policy model performance, making it a reliable reference for selecting reward models to align language models effectively. We evaluate nearly 40 reward models on RM-BENCH. |
| Researcher Affiliation | Academia | 1Fudan University, 2Tsinghua University, 3Hong Kong University of Science and Technology EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methodologies and processes (e.g., RM-BENCH construction, metrics calculation) in natural language and using equations, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Related code and data are available at https://github.com/THU-KEG/RM-Bench. |
| Open Datasets | Yes | Related code and data are available at https://github.com/THU-KEG/RM-Bench. |
| Dataset Splits | Yes | For each prompt x, we compare the chosen and rejected responses across three style levels: concise y , detailed y L, and detailed with Markdown formatting y L,M. This allows us to evaluate reward models ability to distinguish between chosen and rejected responses independently of stylistic differences. |
| Hardware Specification | No | The paper mentions using gpt-4o for response generation but does not specify any hardware used for running the experiments or evaluating the reward models. |
| Software Dependencies | No | The paper mentions various language models and frameworks (e.g., PPO, DPO, gpt-4o, Llama-3.1-8B, Nemotron-340B-Reward) but does not provide specific version numbers for any software dependencies used in their experimental setup. |
| Experiment Setup | Yes | Specifically, we first fine-tuned LLa MA-3-8B using the Tulu-v2 dataset to create the SFT model, followed by PPO training with the Ultrafeedback dataset. For PPO, we used Adam W with a learning rate of 1e 6, a batch size of 64, and a linear warmup scheduler for 10% of the total steps. |