Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning

Authors: Timo Kaufmann, Yannick Metz, Daniel Keim, Eyke Hüllermeier

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate Response Rank s improved accuracy and generalization in multiple domains. (3) We introduce the Pearson Distance Correlation (PDC), a novel metric designed to quantify how well a model captures preference strength. In this section, we evaluate Response Rank on synthetic pairwise comparison datasets with known ground-truth utility differences and simulated response times. We compare against baselines on metrics measuring preference strength learning (PDC) and ordinal accuracy under controlled conditions. To validate our method s applicability to real-world language modeling data, we conducted experiments on the Multi Pref preference dataset [15]. To assess this, we conducted additional experiments in the control domain using simulated episode returns as strength signals.
Researcher Affiliation	Academia	Timo Kaufmann LMU Munich, MCML EMAIL Yannick Metz University of Konstanz EMAIL Daniel Keim University of Konstanz EMAIL Eyke Hüllermeier LMU Munich, MCML, DFKI EMAIL
Pseudocode	No	The paper describes the Response Rank method steps conceptually and illustrates them in Figure 1, but it does not contain a formally labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code	Yes	The source code for our experiments is available at https://github.com/timokau/response-rank.
Open Datasets	Yes	To validate our method s applicability to real-world language modeling data, we conducted experiments on the Multi Pref preference dataset [15]. We tested three Mu Jo Co environments (Half Cheetah, Swimmer, Walker2d) and Highway merge-v0, comparing the BT baseline with Response Rank. Multi Pref dataset (version 1.0) [15]: https://huggingface.co/datasets/allenai/multipref, licensed under ODC-BY. Mu Jo Co v5 environments (Half Cheetah, Swimmer, Walker2d) [37]: https://github.com/google-deepmind/mujoco, licensed under Apache 2.0. highway-env merge-v0 [38]: https://github.com/Farama-Foundation/HighwayEnv, licensed under MIT.
Dataset Splits	Yes	We preprocess the dataset by filtering samples where one of the compared texts exceeds the maximum sequence length of 1,024 tokens resulting in a final dataset of 9,846 samples. We shuffle this filtered dataset and then split 2,000 samples off as a test set, resulting in distinct splits for different random seeds. We generate a synthetic dataset of pairwise comparisons annotated with a choice label and a response time...Each trial uses an independently generated dataset consisting of 50 training examples, 200 test examples, and 20 features per example. We collect a dataset of trajectories...Then we randomly sample 5000 segment pairs of length 50...Out of these 5000 pairs, 4000 are used as the reward model training set, and 1000 samples are used as the validation set.
Hardware Specification	Yes	We conducted the synthetic experiments on a single compute node with 8 CPU cores and 32 GB of RAM. The Multi Pref experiments were run on A100 and H100 GPUs...The RL experiments were conducted on a compute cluster. Steps 1,2, and 4 were conducted on CPU nodes, with each individual run being allocated 2 CPU cores and 16GB RAM. Reward model training was performed on nodes with access to a A100 and H100 GPU shared across four parallel runs.
Software Dependencies	No	The paper mentions software like Stable-Baselines3, PPO, and Adam W, but it does not specify explicit version numbers for these or other key software components, which is necessary for reproducible software dependencies.
Experiment Setup	Yes	We run 100 trials per condition...Each model is trained for 200 gradient steps (Adam W optimizer, no early stopping, learning rate 0.001, weight decay 0.01). Table 7: Hyperparameters and training configuration for Multi Pref experiments. We use the same hyperparameters for BT and all RR variants. Learning Rate 15e-6, LR scheduler Linear with 0.05 warmup ratio, Gradient Clipping 1.0, Weight Decay 0.1, Adam W optimizer settings (β1, β2, ϵ) (0.9, 0.999, 1e-08), On-Device Batch Size 16, Gradient Accumulation Steps 4, Max Sequence Length (tokens) 1024, Dropout no, Precision bfloat16, Epochs 3. We optimize with Adam W [42] with a learning rate of 1e 5, weight decay enabled, batch size of 16, and early stopping on a validation holdout set (patience=5).