Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs
Authors: Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, Tong Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate that the introduced regularization technique markedly improves the accuracy of learned reward models across a variety of out-of-distribution (OOD) tasks and effectively alleviates the over-optimization issue in RLHF, offering a more reliable and robust preference learning paradigm 1. |
| Researcher Affiliation | Academia | Rui Yang1 Ruomeng Ding2 Yong Lin3 4 Huan Zhang1 Tong Zhang1 1University of Illinois Urbana-Champaign, 2Georgia Institute of Technology, 3Princeton University, 4Princeton Language and Intelligence |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks clearly labeled as "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | Code and open-source reward models are available at https://github.com/Yang Rui2015/Generalizable-Reward-Model |
| Open Datasets | Yes | For training reward models, we leverage the Unified-Feedback dataset 2, which stands as one of the largest collections of pairwise feedback datasets. In Section 5.1, we train all reward models on a subset of 400K and 40K samples from the Unified-Feedback dataset and evaluate them on the hold-out 8K eval set. In addition, for evaluating model performance on out-of-distribution (OOD) preference data, we utilize datasets such as HHH-Alignment 3 [44], MT-Bench Human Judgements 4 [45], and Reward Bench [46]. |
| Dataset Splits | Yes | Specifically, we reserve 1% of the training set for validation (e.g., 4K for 400K training data) and found that 2 epochs are sufficient for reward modeling with Lo RA in our setting. |
| Hardware Specification | Yes | We use NVIDIA RTX A6000 49G for our experiments. |
| Software Dependencies | No | The paper mentions software like "transformers" [72] and "trl" [73] in Appendix B but does not provide specific version numbers for these libraries within the paper's text. |
| Experiment Setup | Yes | More details are listed in Table 6. To use the Unified-Feedback dataset, we downsample the training data from the all set and use all the 8K test data for evaluation. For the HHH Alignment dataset, we adopt the average score of all four subsets as the result. For the main experiments trained with Lo RA, we truncate the inputs for all reward models over 1024 tokens. All reward models are trained for two epochs using a learning rate of 1e-5 and a batch size of 16. We load the model with the bf16 precision. Regarding the full parameter training, we truncate the inputs over 4096 tokens and train the reward model for one epoch with a learning rate of 2e-6 and a batch size of 512 (with gradient accumulation). |