Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Inference-Time Reward Hacking in Large Language Models
Authors: Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio Calmon
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first demonstrate that hedging mitigates reward hacking on standard verifiable benchmarks such as MMLU Pro and GPQA. Specifically, we use the open-source Preference Proxy Evaluations (PPE) dataset [39], which contains multiple responses per benchmark question from frontier models such as GPT-4o-mini and Claude Haiku 3. Each response is then scored by a suite of reward models. We focus on the benchmark dataset reward model pairs identified in PPE as exhibiting reward hacking under Best-of-n sampling. We then apply Hedge Tune to find the optimal operating point for Bo N, Bo P, and SBo N. ... The observed reward hacking curve in Figure 3 matches our theoretical prediction in Theorem 1. For example, Bo N demonstrates hacking on the GPQA dataset, even with the very capable Skywork-Llama 3.1 8B as a proxy reward (currently ranked the 12th best non-generative reward model on Reward Bench [37]). Hedge Tune recovers the best operating points in all the cases. |
| Researcher Affiliation | Academia | Hadi Khalaf Harvard University Claudio Mayrink Verdun Harvard University Alex Oesterling Harvard University Himabindu Lakkaraju Harvard University Flavio du Pin Calmon Harvard University |
| Pseudocode | Yes | Algorithm 1 Best-of-n Sampling (Bo N) Algorithm 2 Soft Best-of-n Sampling (SBo N) Algorithm 3 Best-of-Poisson Sampling (Bo P) Algorithm 4 Hedge Tune: Parameter Optimization for Hedging |
| Open Source Code | Yes | We provide our code here. |
| Open Datasets | Yes | Specifically, we use the open-source Preference Proxy Evaluations (PPE) dataset [39]... We consider three datasets: MMLU Pro (complex reasoning) [43], MATH (mathematical problem solving) [44], and GPQA (questions in natural sciences) [45]. ... We use the tlc4418/gold_labelled_gens dataset from Coste et al. [46]. |
| Dataset Splits | Yes | We use the tlc4418/gold_labelled_gens dataset from Coste et al. [46]. This dataset comprises of 12,600 responses generated by the Pythia 1.4B base policy for each of 1,000 prompts. The prompts are sourced from the validation split of the Alpaca Farm dataset [48]. ... To train proxy reward models, we construct preference datasets using the Alpaca RM reward scores. For each training instance, we sample a pair of responses to a given prompt and label them based on their given true reward scores. We train proxy RMs on preference pair datasets of varying sizes: 10k, 20k, 46k, and 80k. |
| Hardware Specification | Yes | The experiments with the toy example and the verifiable rewards were done on a CPU (Apple M3 Chip). The experiments with the human-preferences were done on 1 A100 GPU with 48 GPU hours (including preliminary experiments and testing). |
| Software Dependencies | No | The paper mentions using a Pythia model [47] fine-tuned on Alpaca Farm dataset and various reward models (Intern LM-2 1.8B [40], Llama3-Offset-Bias 8B [41], Skywork-Llama-3.1 8B [42]) but does not specify software dependencies with version numbers for their own experimental setup beyond general model names. |
| Experiment Setup | Yes | We use a 1.4B Pythia model [47] fine-tuned on Alpaca Farm dataset... We use Alpaca RM [48] as our gold reward model... We train proxy RMs on preference pair datasets of varying sizes: 10k, 20k, 46k, and 80k. ... All proxy RM training runs are repeated across 4 random seeds each. ... In line with [46, 51, 52], we simulate disagreements in human annotators by considering two cases: (a) no label noise in the preferences, and (b) 25% label noise. ... using their default hyperparameters (e.g., 10^-5 learning rate and five epochs). |