Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Information-Theoretic Reward Decomposition for Generalizable RLHF
Authors: Liyuan Mao, Haoran Xu, Amy Zhang, Weinan Zhang, Chenjia Bai
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we demonstrate the significance of identifying and utilizing prompt-free rewards to guide reward learning from two perspectives. We first illustrate that, for some manually crafted datasets, the extracted prompt-free reward reflects the reward model s preference bias and the prioritization effectively aids in reward learning. Subsequently, based on some commonly used open-source preference datasets, we evaluate the trained reward model using direct metrics (reward model accuracy) and indirect metrics (performance of the induced policy). |
| Researcher Affiliation | Collaboration | Liyuan Mao1 , Haoran Xu2, Amy Zhang2, Weinan Zhang1 , Chengjia Bai3 1Shanghai Jiao Tong University, 2UT Austin, 3Institute of Artificial Intelligence, China Telecom |
| Pseudocode | Yes | Due to limited space, we provide the pseudo-code of the binary search process in Appendix C.1 (Alg. 1). Note that such an algorithm doesn t require any extra parameterized reward models. |
| Open Source Code | No | Code will be released in the future. |
| Open Datasets | Yes | We first illustrate that, for some manually crafted datasets, the extracted prompt-free reward reflects the reward model s preference bias and the prioritization effectively aids in reward learning. Subsequently, based on some commonly used open-source preference datasets, we evaluate the trained reward model using direct metrics (reward model accuracy) and indirect metrics (performance of the induced policy). For example, SHP [14], RLHFlow preference dataset (https://huggingface.co/datasets/RLHFlow/pair_preference_model_dataset), Ultrafeedback [9], Alpaca Eval-2 [13], MT-Bench [41]. |
| Dataset Splits | Yes | We construct a length-biased dataset Dbias that contains preference pairs with 80% chosen-longer responses (i.e. |yw| > |yl|) and 20% chosen-shorter responses. The details of the construction are given in Appendix E.4. For the training data, we use a randomly sampled subset of 300K examples from the RLHFlow preference dataset (originally 700K). |
| Hardware Specification | Yes | Experiments are run on Nvidia A100(40G) GPUs. For reward training, we use 8*A100(40G) GPUs so that the training can be finished in 12 hours. For DPO training and response generation, we also use 8*A100(40G) GPUs, and the time consumption is similar to the reward training. |
| Software Dependencies | No | Our implementation is based on the Open RLHF [16] framework, which uses Apache-2.0 license. We use Deepspeed [2] as the framework of parallelization. For the implementation of RRM, we reference its official code in RLHFlow. No specific version numbers for these software packages are provided. |
| Experiment Setup | Yes | We use an Adam W optimizer with a learning rating of 5e-7 and 9e-6 for the DPO policy and the reward model, respectively. For other general hyperparameters, we follow the default parameters in Open RLHF. Our exclusive hyperparameters include clustering methods and EMA weights for r2 thresholding. In practice, we tried Otsu s method and K-means for clustering method and 0.8, 0.9 for EMA weight. |