Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rectifying Shortcut Behaviors in Preference-based Reward Learning

Authors: Wenqian Ye, Guangtao Zheng, Aidong Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results in several benchmarks show that our method consistently improves the accuracy of the reward model on diverse out-of-distribution tasks and reduces the dependency on shortcuts in downstream policy models, establishing a robust framework for preference-based alignment.
Researcher Affiliation Collaboration Wenqian Ye University of Virginia EMAIL Guangtao Zheng Accenture EMAIL Aidong Zhang University of Virginia EMAIL
Pseudocode Yes Algorithm 1 Fallback Function for LLM-as-a-Judge
Open Source Code Yes We provide an anonymous link to our code in the footnote of the abstract.
Open Datasets Yes We use our proposed method to train reward models on a mixture of preference datasets collected by the RLHFlow framework [35]. It combines 8 popular open-source preference datasets, each containing preference triplets in the form of (prompt, chosen response, rejected response) defined in Section 2. These datasets have been widely used to train a series of strong open-source preference language models. Although some of the datasets (e.g., Help Steer [36]) provide fine-grained attributes of training samples, in our setting, we do not use these attributes during training to reflect a real-world setting where such auxiliary information is not available. More details of the training data are deferred to the Appendix. ... The RLHFlow training dataset [35] used in our experiments integrates multiple open-source preference datasets, each selected to cover diverse preference scenarios and annotation methods. Specifically, the dataset includes general conversational preference data, such as HH-RLHF [64], consisting of human-annotated conversational pairs; SHP [65], containing community-driven Reddit interactions; and Help Steer [36], featuring prompts evaluated on various human-assessed criteria (e.g., helpfulness, coherence). Additionally, the dataset comprises task-specific data: PKU-Safe RLHF [66] provides expert-annotated safety and helpfulness comparisons; Ultra Feedback [43] offers GPT-4 annotations focusing on instruction-following and truthfulness across diverse models; and Ultra Interact [67] contributes complex reasoning tasks structured into preference trees with detailed annotations. Finally, multi-turn conversational datasets like Distilabel-Capybara [68] and Distilabel-Orca [69] further enrich the training set with GPT-4 annotated dialogue preferences originating from distinct prompt collections.
Dataset Splits No The paper mentions using a 'mixture of preference datasets' for training and evaluating on 'out-of-distribution benchmarks' (Reward Bench and RM-Bench), but it does not specify explicit training/test/validation splits (e.g., percentages, sample counts, or defined custom splits) for the combined training data or how it was partitioned for the experiments.
Hardware Specification Yes All experiments are conducted on 8 NVIDIA A6000 GPUs.
Software Dependencies No The paper mentions software like Huggingface, Deep Speed, GPT-4o models, and Langchain APIs, but it does not provide specific version numbers for any of these components.
Experiment Setup Yes We do not specifically tune the two regularization hyperparameters λ1 and λ2, and instead adopt a curriculum learning paradigm [37] by linearly increasing them from 0.01 to 0.1 over the first half of the training process and then decreasing them to 0.06 by the end. We use a learning rate of 2 10 6 with a cosine annealing scheduler and a warmup phase covering 3% of the total training steps.