Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Pairwise Calibrated Rewards for Pluralistic Alignment

Authors: Daniel Halpern, Evi Micha, Ariel D Procaccia, Itai Shapira

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we introduce and validate a practical training heuristic to learn such ensembles, and demonstrate its effectiveness through improved calibration, implying a more faithful representation of pluralistic values. ... We focus on two questions: (i) Can a small ensemble of weak reward models match the observed vote fractions more accurately than any single reward model? (ii) Do the individual reward models in the ensemble capture distinct preference patterns rather than duplicating one another? For both, we find positive results. ... Our results (Figure 2) show that, in many cases, an ensemble of only 2-4 such rewards already achieves noticeably better calibration on held-out prompts.
Researcher Affiliation Academia 1Harvard University 2University of Southern California
Pseudocode No Instead, we propose a heuristic approach based on forward stagewise additive modeling (FSAM) (see Hastie et al. [44] for an overview) that decomposes the problem into a sequence of more tractable subproblems. At each step, we fit a new reward model to the current residual error of the ensemble, keeping previously learned models fixed; pick a mixing weight that best reduces that error, and append the new model to the mixture.
Open Source Code No Answer: [No] Justification: The code uses standard reward modeling training, with all details included in Appendix F. We d be happy to share the full code if needed.
Open Datasets Yes We use four public datasets that satisfy these requirements and exhibit annotator disagreement: Multi Pref [59], Personal LLM [60], Help Steer2 [61], and Reddit TL;DR [62].
Dataset Splits Yes For datasets without an official validation split, we place 10% of prompts into a test set, ensuring that no prompt appears in both splits.
Hardware Specification Yes Training is conducted with BF16 precision on a single NVIDIA H100 GPU, utilizing gradient accumulation to accommodate an effective batch size of up to 512.
Software Dependencies No We fine-tune all model parameters, including both the base transformer and the final linear reward head, using the SOAP optimizer [77], which we found to accelerate training compared to Adam W. ... We fit an ensemble of k=8 reward models on each dataset, starting from supervised fine-tuned checkpoints of Meta-Llama3-8B [63] (a juggernaut in the small-model bracket).
Experiment Setup Yes We train for only a single epoch, generally sufficient to achieve convergence without overfitting, as demonstrated by previous reward-model training studies [62, 64, 54, 78, 79, 53, 56]. Training is conducted with BF16 precision on a single NVIDIA H100 GPU, utilizing gradient accumulation to accommodate an effective batch size of up to 512. We adopt learning rates in the range {1e 5 . . . 5e 5} with a cosine decay schedule, a linear warmup spanning the first 3% of training steps, and weight decay set to 0.1.