Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Rectifying Shortcut Behaviors in Preference-based Reward Learning
Authors: Wenqian Ye, Guangtao Zheng, Aidong Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results in several benchmarks show that our method consistently improves the accuracy of the reward model on diverse out-of-distribution tasks and reduces the dependency on shortcuts in downstream policy models, establishing a robust framework for preference-based alignment. |
| Researcher Affiliation | Collaboration | Wenqian Ye University of Virginia EMAIL Guangtao Zheng Accenture EMAIL Aidong Zhang University of Virginia EMAIL |
| Pseudocode | Yes | Algorithm 1 Fallback Function for LLM-as-a-Judge |
| Open Source Code | Yes | We provide an anonymous link to our code in the footnote of the abstract. |
| Open Datasets | Yes | We use our proposed method to train reward models on a mixture of preference datasets collected by the RLHFlow framework [35]. It combines 8 popular open-source preference datasets, each containing preference triplets in the form of (prompt, chosen response, rejected response) defined in Section 2. These datasets have been widely used to train a series of strong open-source preference language models. Although some of the datasets (e.g., Help Steer [36]) provide fine-grained attributes of training samples, in our setting, we do not use these attributes during training to reflect a real-world setting where such auxiliary information is not available. More details of the training data are deferred to the Appendix. ... The RLHFlow training dataset [35] used in our experiments integrates multiple open-source preference datasets, each selected to cover diverse preference scenarios and annotation methods. Specifically, the dataset includes general conversational preference data, such as HH-RLHF [64], consisting of human-annotated conversational pairs; SHP [65], containing community-driven Reddit interactions; and Help Steer [36], featuring prompts evaluated on various human-assessed criteria (e.g., helpfulness, coherence). Additionally, the dataset comprises task-specific data: PKU-Safe RLHF [66] provides expert-annotated safety and helpfulness comparisons; Ultra Feedback [43] offers GPT-4 annotations focusing on instruction-following and truthfulness across diverse models; and Ultra Interact [67] contributes complex reasoning tasks structured into preference trees with detailed annotations. Finally, multi-turn conversational datasets like Distilabel-Capybara [68] and Distilabel-Orca [69] further enrich the training set with GPT-4 annotated dialogue preferences originating from distinct prompt collections. |
| Dataset Splits | No | The paper mentions using a 'mixture of preference datasets' for training and evaluating on 'out-of-distribution benchmarks' (Reward Bench and RM-Bench), but it does not specify explicit training/test/validation splits (e.g., percentages, sample counts, or defined custom splits) for the combined training data or how it was partitioned for the experiments. |
| Hardware Specification | Yes | All experiments are conducted on 8 NVIDIA A6000 GPUs. |
| Software Dependencies | No | The paper mentions software like Huggingface, Deep Speed, GPT-4o models, and Langchain APIs, but it does not provide specific version numbers for any of these components. |
| Experiment Setup | Yes | We do not specifically tune the two regularization hyperparameters λ1 and λ2, and instead adopt a curriculum learning paradigm [37] by linearly increasing them from 0.01 to 0.1 over the first half of the training process and then decreasing them to 0.06 by the end. We use a learning rate of 2 10 6 with a cosine annealing scheduler and a warmup phase covering 3% of the total training steps. |