Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Inference-Time Reward Hacking in Large Language Models

Authors: Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio Calmon

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first demonstrate that hedging mitigates reward hacking on standard verifiable benchmarks such as MMLU Pro and GPQA. Specifically, we use the open-source Preference Proxy Evaluations (PPE) dataset [39], which contains multiple responses per benchmark question from frontier models such as GPT-4o-mini and Claude Haiku 3. Each response is then scored by a suite of reward models. We focus on the benchmark dataset reward model pairs identified in PPE as exhibiting reward hacking under Best-of-n sampling. We then apply Hedge Tune to find the optimal operating point for Bo N, Bo P, and SBo N. ... The observed reward hacking curve in Figure 3 matches our theoretical prediction in Theorem 1. For example, Bo N demonstrates hacking on the GPQA dataset, even with the very capable Skywork-Llama 3.1 8B as a proxy reward (currently ranked the 12th best non-generative reward model on Reward Bench [37]). Hedge Tune recovers the best operating points in all the cases.
Researcher Affiliation	Academia	Hadi Khalaf Harvard University Claudio Mayrink Verdun Harvard University Alex Oesterling Harvard University Himabindu Lakkaraju Harvard University Flavio du Pin Calmon Harvard University
Pseudocode	Yes	Algorithm 1 Best-of-n Sampling (Bo N) Algorithm 2 Soft Best-of-n Sampling (SBo N) Algorithm 3 Best-of-Poisson Sampling (Bo P) Algorithm 4 Hedge Tune: Parameter Optimization for Hedging
Open Source Code	Yes	We provide our code here.
Open Datasets	Yes	Specifically, we use the open-source Preference Proxy Evaluations (PPE) dataset [39]... We consider three datasets: MMLU Pro (complex reasoning) [43], MATH (mathematical problem solving) [44], and GPQA (questions in natural sciences) [45]. ... We use the tlc4418/gold_labelled_gens dataset from Coste et al. [46].
Dataset Splits	Yes	We use the tlc4418/gold_labelled_gens dataset from Coste et al. [46]. This dataset comprises of 12,600 responses generated by the Pythia 1.4B base policy for each of 1,000 prompts. The prompts are sourced from the validation split of the Alpaca Farm dataset [48]. ... To train proxy reward models, we construct preference datasets using the Alpaca RM reward scores. For each training instance, we sample a pair of responses to a given prompt and label them based on their given true reward scores. We train proxy RMs on preference pair datasets of varying sizes: 10k, 20k, 46k, and 80k.
Hardware Specification	Yes	The experiments with the toy example and the verifiable rewards were done on a CPU (Apple M3 Chip). The experiments with the human-preferences were done on 1 A100 GPU with 48 GPU hours (including preliminary experiments and testing).
Software Dependencies	No	The paper mentions using a Pythia model [47] fine-tuned on Alpaca Farm dataset and various reward models (Intern LM-2 1.8B [40], Llama3-Offset-Bias 8B [41], Skywork-Llama-3.1 8B [42]) but does not specify software dependencies with version numbers for their own experimental setup beyond general model names.
Experiment Setup	Yes	We use a 1.4B Pythia model [47] fine-tuned on Alpaca Farm dataset... We use Alpaca RM [48] as our gold reward model... We train proxy RMs on preference pair datasets of varying sizes: 10k, 20k, 46k, and 80k. ... All proxy RM training runs are repeated across 4 random seeds each. ... In line with [46, 51, 52], we simulate disagreements in human annotators by considering two cases: (a) no label noise in the preferences, and (b) 25% label noise. ... using their default hyperparameters (e.g., 10^-5 learning rate and five epochs).