Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Information-Theoretic Reward Decomposition for Generalizable RLHF

Authors: Liyuan Mao, Haoran Xu, Amy Zhang, Weinan Zhang, Chenjia Bai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we demonstrate the significance of identifying and utilizing prompt-free rewards to guide reward learning from two perspectives. We first illustrate that, for some manually crafted datasets, the extracted prompt-free reward reflects the reward model s preference bias and the prioritization effectively aids in reward learning. Subsequently, based on some commonly used open-source preference datasets, we evaluate the trained reward model using direct metrics (reward model accuracy) and indirect metrics (performance of the induced policy).
Researcher Affiliation	Collaboration	Liyuan Mao1 , Haoran Xu2, Amy Zhang2, Weinan Zhang1 , Chengjia Bai3 1Shanghai Jiao Tong University, 2UT Austin, 3Institute of Artificial Intelligence, China Telecom
Pseudocode	Yes	Due to limited space, we provide the pseudo-code of the binary search process in Appendix C.1 (Alg. 1). Note that such an algorithm doesn t require any extra parameterized reward models.
Open Source Code	No	Code will be released in the future.
Open Datasets	Yes	We first illustrate that, for some manually crafted datasets, the extracted prompt-free reward reflects the reward model s preference bias and the prioritization effectively aids in reward learning. Subsequently, based on some commonly used open-source preference datasets, we evaluate the trained reward model using direct metrics (reward model accuracy) and indirect metrics (performance of the induced policy). For example, SHP [14], RLHFlow preference dataset (https://huggingface.co/datasets/RLHFlow/pair_preference_model_dataset), Ultrafeedback [9], Alpaca Eval-2 [13], MT-Bench [41].
Dataset Splits	Yes	We construct a length-biased dataset Dbias that contains preference pairs with 80% chosen-longer responses (i.e. \|yw\| > \|yl\|) and 20% chosen-shorter responses. The details of the construction are given in Appendix E.4. For the training data, we use a randomly sampled subset of 300K examples from the RLHFlow preference dataset (originally 700K).
Hardware Specification	Yes	Experiments are run on Nvidia A100(40G) GPUs. For reward training, we use 8A100(40G) GPUs so that the training can be finished in 12 hours. For DPO training and response generation, we also use 8A100(40G) GPUs, and the time consumption is similar to the reward training.
Software Dependencies	No	Our implementation is based on the Open RLHF [16] framework, which uses Apache-2.0 license. We use Deepspeed [2] as the framework of parallelization. For the implementation of RRM, we reference its official code in RLHFlow. No specific version numbers for these software packages are provided.
Experiment Setup	Yes	We use an Adam W optimizer with a learning rating of 5e-7 and 9e-6 for the DPO policy and the reward model, respectively. For other general hyperparameters, we follow the default parameters in Open RLHF. Our exclusive hyperparameters include clustering methods and EMA weights for r2 thresholding. In practice, we tried Otsu s method and K-means for clustering method and 0.8, 0.9 for EMA weight.