The Trickle-down Impact of Reward Inconsistency on RLHF
Authors: Lingfeng Shen, Sihao Chen, Linfeng Song, Lifeng Jin, Baolin Peng, Haitao Mi, Daniel Khashabi, Dong Yu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We observe that current RMs trained with the standard ranking objective fail miserably on CONTRAST INSTRUCTIONS compared to average humans. To show that RM consistency can be improved efficiently without using extra training budget, we propose two techniques CONVEXDA and REWARDFUSION, which enhance reward consistency through extrapolation during the RM training and inference stage, respectively. We show that RLHF models trained with a more consistent RM yield more useful responses, suggesting that reward inconsistency exhibits a trickle-down effect on the downstream RLHF process. and With CONTRAST INSTRUCTIONS, we evaluate the consistency of RMs trained with the standard ranking objective (Eq. 1). |
| Researcher Affiliation | Collaboration | Lingfeng Shen Sihao Chen Linfeng Song Lifeng Jin Baolin Peng Haitao Mi Daniel Khashabi Dong Yu Johns Hopkins University University of Pennsylvania Tencent AI Lab |
| Pseudocode | Yes | Algorithm 1: Vanilla Attack (VA) |
| Open Source Code | No | No explicit statement or link for open-source code for the methodology presented in this paper. The only link is to a chatbot arena leaderboard, which is an external resource. |
| Open Datasets | Yes | We adopt four open-source human preference datasets of various NLP tasks: STACKEXCHANGE for question answering (Askell et al., 2021), WMT for machine translation (Ma et al., 2019), REALSUMM for text summarization (Bhandari et al., 2020), and TWITTER for paraphrase generation (Shen et al., 2022c). |
| Dataset Splits | Yes | Using a human preference dataset, we have divided it into training, development, and testing sets. The reward model is trained on the training set and ceases training once it attains optimal performance on the development set. Subsequently, it is evaluated on the test set. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models or specific cloud instances) are provided, only general statements about 'resource constraints' and 'computational limits'. |
| Software Dependencies | No | The paper mentions software like 'LLaMa-7B checkpoint', 'PPO algorithm', 'Low-Rank Adaptor (LoRA)', 'AdamW optimizer', and 'Adafactor optimizer', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We use the Adam W optimizer and set a learning rate of 2e-5. For multitask training, we combine the training set from selected benchmarks and train using Lo RA with a learning rate of 3e-5. ... The learning rate is set to 1.4e 5, and we utilize the Adafactor optimizer. The default learning rate scheduler type is set to linear . The initial KL penalty coefficient is set as 0.2, and an adaptive KL control is used, with a linear scheduler. The pretraining gradient coefficient γ is set to 0 for our experiments. |