Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs

Authors: Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, Tong Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that the introduced regularization technique markedly improves the accuracy of learned reward models across a variety of out-of-distribution (OOD) tasks and effectively alleviates the over-optimization issue in RLHF, offering a more reliable and robust preference learning paradigm 1.
Researcher Affiliation Academia Rui Yang1 Ruomeng Ding2 Yong Lin3 4 Huan Zhang1 Tong Zhang1 1University of Illinois Urbana-Champaign, 2Georgia Institute of Technology, 3Princeton University, 4Princeton Language and Intelligence
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks clearly labeled as "Pseudocode" or "Algorithm".
Open Source Code Yes Code and open-source reward models are available at https://github.com/Yang Rui2015/Generalizable-Reward-Model
Open Datasets Yes For training reward models, we leverage the Unified-Feedback dataset 2, which stands as one of the largest collections of pairwise feedback datasets. In Section 5.1, we train all reward models on a subset of 400K and 40K samples from the Unified-Feedback dataset and evaluate them on the hold-out 8K eval set. In addition, for evaluating model performance on out-of-distribution (OOD) preference data, we utilize datasets such as HHH-Alignment 3 [44], MT-Bench Human Judgements 4 [45], and Reward Bench [46].
Dataset Splits Yes Specifically, we reserve 1% of the training set for validation (e.g., 4K for 400K training data) and found that 2 epochs are sufficient for reward modeling with Lo RA in our setting.
Hardware Specification Yes We use NVIDIA RTX A6000 49G for our experiments.
Software Dependencies No The paper mentions software like "transformers" [72] and "trl" [73] in Appendix B but does not provide specific version numbers for these libraries within the paper's text.
Experiment Setup Yes More details are listed in Table 6. To use the Unified-Feedback dataset, we downsample the training data from the all set and use all the 8K test data for evaluation. For the HHH Alignment dataset, we adopt the average score of all four subsets as the result. For the main experiments trained with Lo RA, we truncate the inputs for all reward models over 1024 tokens. All reward models are trained for two epochs using a learning rate of 1e-5 and a batch size of 16. We load the model with the bf16 precision. Regarding the full parameter training, we truncate the inputs over 4096 tokens and train the reward model for one epoch with a learning rate of 2e-6 and a batch size of 512 (with gradient accumulation).