Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Pairwise Calibrated Rewards for Pluralistic Alignment

Authors: Daniel Halpern, Evi Micha, Ariel D Procaccia, Itai Shapira

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we introduce and validate a practical training heuristic to learn such ensembles, and demonstrate its effectiveness through improved calibration, implying a more faithful representation of pluralistic values. ... We focus on two questions: (i) Can a small ensemble of weak reward models match the observed vote fractions more accurately than any single reward model? (ii) Do the individual reward models in the ensemble capture distinct preference patterns rather than duplicating one another? For both, we find positive results. ... Our results (Figure 2) show that, in many cases, an ensemble of only 2-4 such rewards already achieves noticeably better calibration on held-out prompts.
Researcher Affiliation	Academia	1Harvard University 2University of Southern California
Pseudocode	No	Instead, we propose a heuristic approach based on forward stagewise additive modeling (FSAM) (see Hastie et al. [44] for an overview) that decomposes the problem into a sequence of more tractable subproblems. At each step, we fit a new reward model to the current residual error of the ensemble, keeping previously learned models fixed; pick a mixing weight that best reduces that error, and append the new model to the mixture.
Open Source Code	No	Answer: [No] Justification: The code uses standard reward modeling training, with all details included in Appendix F. We d be happy to share the full code if needed.
Open Datasets	Yes	We use four public datasets that satisfy these requirements and exhibit annotator disagreement: Multi Pref [59], Personal LLM [60], Help Steer2 [61], and Reddit TL;DR [62].
Dataset Splits	Yes	For datasets without an official validation split, we place 10% of prompts into a test set, ensuring that no prompt appears in both splits.
Hardware Specification	Yes	Training is conducted with BF16 precision on a single NVIDIA H100 GPU, utilizing gradient accumulation to accommodate an effective batch size of up to 512.
Software Dependencies	No	We fine-tune all model parameters, including both the base transformer and the final linear reward head, using the SOAP optimizer [77], which we found to accelerate training compared to Adam W. ... We fit an ensemble of k=8 reward models on each dataset, starting from supervised fine-tuned checkpoints of Meta-Llama3-8B [63] (a juggernaut in the small-model bracket).
Experiment Setup	Yes	We train for only a single epoch, generally sufficient to achieve convergence without overfitting, as demonstrated by previous reward-model training studies [62, 64, 54, 78, 79, 53, 56]. Training is conducted with BF16 precision on a single NVIDIA H100 GPU, utilizing gradient accumulation to accommodate an effective batch size of up to 512. We adopt learning rates in the range {1e 5 . . . 5e 5} with a cosine decay schedule, a linear warmup spanning the first 3% of training steps, and weight decay set to 0.1.