Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Post-hoc Reward Calibration: A Case Study on Length Bias

Authors: Zeyu Huang, Zihan Qiu, zili wang, Edoardo M. Ponti, Ivan Titov

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Focusing on the prevalent length bias, we validate our proposed approaches across three experimental settings, demonstrating consistent improvements: (1) a 3.11 average performance gain across 33 reward models on the Reward Bench dataset; (2) improved agreement of RM produced rankings with GPT-4 evaluations and human preferences based on the Alpaca Eval benchmark; and (3) improved Length-Controlled win rate (Dubois et al., 2024) of the RLHF process in multiple LLM RM combinations. According to our experiments, our method is computationally efficient and generalisable to other types of bias and RMs, offering a scalable and robust solution for mitigating biases in LLM alignment and evaluation.
Researcher Affiliation Collaboration 1Zeyu Huang 2Zihan Qiu 3Zili Wang 1Edoardo M. Ponti 1,4Ivan Titov 1University of Edinburgh 2Alibaba Group 3INF Technology 4University of Amsterdam EMAIL
Pseudocode Yes The full algorithm of using LOWESS for reward calibration is described with Appendix B Algorithm 1.
Open Source Code No We directly employ off-the-shelf library statsmodels.api and pass the dataset Dc to it for our implementation.
Open Datasets Yes We test the calibrated RM s performance on the Reward Bench dataset (Lambert et al., 2024). Based on the Alpaca Eval benchmark (Dubois et al., 2023a), we utilise 8 open-source RMs to rank 184 LLMs. They employ the 60k instructions from the Ultra Feedback (Cui et al., 2024) dataset and sample 5 responses for each instruction from the LLM. huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback-armorm and huggingface.co/datasets/princeton-nlp/gemma2-ultrafeedback-armorm
Dataset Splits Yes For Reward Bench, there are 2,985 test data points, so we employ all 5,970 data points to perform the regression (every test data is a pair of responses for one prompt). Regarding Alpaca Eval, there are 184 LLMs and 805 instructions for each LLM, resulting in 151k samples used for calibration. For LLMs alignment, five responses are generated from the LLM for about 60k instructions, leading to about 300k samples used for bias estimation.
Hardware Specification Yes All experiments about reward calibration are performed with a single CPU. Experiments for getting the reward model s score on Alpaca Eval s output are conducted using a NVIDIA H100 GPU. Experiments regarding using DPO for LLM alignment use 8 NVIDIA H100 GPU and costs about 2-3 hours for the DPO training.
Software Dependencies No We directly employ off-the-shelf library statsmodels.api and pass the dataset Dc to it for our implementation. We employ the Open Compass (Contributors, 2023) for benchmark evaluation and report the average score of these benchmarks.
Experiment Setup Yes All hyper-parameters related to RC-Mean, RC-LWR and RC-LWR-Penalty are presented in Appendix C. (1) Determine the threshold d: The way we use to determine the threshold distance d in Eq. 6 is as follows: ... d = AVG(||x1| |x2||))/4... (2) Determine the bandwidth f for LOWESS: ... we set the bandwidth as default as in statsmodels.api, f = 1/3, and the number of iterations is 3. ... For Reward Bench ... we choose a larger bandwidth f = 0.9. ... A larger learning rate of 5e-7 is set for gemma2-based experiments compared to 3e-7 for Llama3-based experiments following Meng et al. (2024).