Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

Authors: xueru wen, Jie Lou, Yaojie Lu, Hongyu Lin, XingYu, Xinyu Lu, Ben He, Xianpei Han, Debing Zhang, Le Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we conduct experiments in a synthetic setting to investigate how differences in RM measured by accuracy translate into gaps in optimized policy performance. Our findings reveal that while there is a weak positive correlation between accuracy and downstream performance, policies optimized towards RMs with similar accuracy can exhibit quite different performance. Moreover, we discover that the way of measuring accuracy significantly impacts its ability to predict the final policy performance.
Researcher Affiliation Collaboration 1 Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences 2 University of Chinese Academy of Sciences 3 Xiaohongshu Inc EMAIL EMAIL {benhe}@ucas.ac.cn {loujie0822}@gmail.com {dengyang}@xiaohongshu.com
Pseudocode No The paper describes methods and processes in narrative text and bullet points but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code or an algorithm.
Open Source Code No The paper does not explicitly state that the authors are releasing their own code for the methodology described. It mentions utilizing the Open RLHF framework (Hu et al., 2024), but this refers to a third-party tool, not their specific implementation.
Open Datasets Yes The data from the Reward Bench (Lambert et al., 2024) are used to construct the test datasets Dtest RM and Dtest RL . We build up a RM training dataset by mixing the following open-sourced datasets: Nectar (Zhu et al., 2023) Capybara-7K-binarized (Argilla, 2024) Orca-pairs (Intel, 2023) Ultra Feedback (Cui et al., 2023) PKU-Safe RLHF (Ji et al., 2024) MTBench-human (Zheng et al., 2023) Chatbot-arena (Zheng et al., 2023) HH-RLHF (Bai et al., 2022; Ganguli et al., 2022)
Dataset Splits Yes Ultimately, we retained 112400 preference data samples, with 7052 set aside as a validation set for reward model training. We perform deduplication on the prompts, leaving 2,733 distinct prompts.
Hardware Specification No The paper provides hyperparameters for RM and PPO training in Table 8 but does not specify any hardware details like GPU models, CPU models, or cloud computing instances used for the experiments.
Software Dependencies No All RMs are initialized from the Llama-3-instruct-8B model (AI@Meta, 2024) and finetuned by minimizing the negative log-likelihood loss with regularization term (Hou et al., 2024). We adopt best-of-n sampling (Bo N) and PPO (Schulman et al., 2017) as the algorithm ARL to optimize the initial policy π0, which is also the Llama-3-instruct-8B model. For PPO optimization, we utilize the Open RLHF framework (Hu et al., 2024).
Experiment Setup Yes Table 8: Summary of Training Hyperparameters RM Training Hyperparameters Value Max Length 2048 Train Batch Size 64 Regularization Coefficient 1e-2 Rollout Batch Size 8 Batch Size 256 Generate Max Length 1024 Warmup Ratio 0.1 Actor Learning Rate 1e-6 Learning Rate Scheduler cosine Critic Learning Rate 1e-5 Learning Rate 5e-6 KL Penalty 0