Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Strategyproof Reinforcement Learning from Human Feedback

Authors: Thomas Kleine Buening, Jiarui Gan, Debmalya Mandal, Marta Kwiatkowska

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental C Experiments: Simulating Strategic Preference Labeling We here conduct small-scale synthetic experiments that simulate strategic preference learning and serve as a preliminary empirical evaluation of the proposed methodology. ... All results are averaged over 5 random seeds, and we report standard errors. ... Table 1: Suboptimality Sub Opt(ˆπ) under truthful and strategic labeling across dataset sizes n.
Researcher Affiliation Academia Thomas Kleine Buening ETH Zurich ETH AI Center Jiarui Gan Department of Computer Science University of Oxford Debmalya Mandal Department of Computer Science University of Warwick Marta Kwiatkowska Department of Computer Science University of Oxford
Pseudocode Yes Algorithm 1 Pessimistic Median of MLEs (Pessimistic Mo MLEs) 1: input offline preference data D = (D1, . . . , Dk) 2: for every labeler i [k] do 3: compute the MLE ˆθMLE i from Di 4: construct confidence set Ci ..= {θ Rd : ˆθMLE i θ ΣDi f(d, n, δ)} 5: end for 6: get the median confidence set C ..= {med(θ1, . . . , θk): θi Ci for i [k]} 7: compute the pessimistic median return w.r.t. C given by W(π) ..= min θ C Es ρ [ θ, ϕ(s, π(s)) ] 8: return ˆπ(D) = argmaxπ Π W(π)
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA]
Open Datasets No We simulate strategic labeling behavior by performing approximate gradient ascent (i.e., simultaneous perturbation stochastic approximation) on each labeler s utility Ji(ˆπ) w.r.t. the labelers internal reward parameters ˆθi, which govern their preference distribution Pˆθi. We adopt this simulation approach from prior work on strategic contextual bandits [23]. Each labeler is initialized at their ground-truth reward vector θ i , which is sampled from a multivariate Gaussian.
Dataset Splits No We focus on small problem settings in a contextual bandit formulation. All results are averaged over 5 random seeds, and we report standard errors. The results below are for embedding dimension d = 16, number of labelers k = 5, and offline samples n = 20, 50, 100, 200.
Hardware Specification No This paper does not include experiments.
Software Dependencies No The experimental setup in Appendix C describes the simulation methodology but does not specify any particular software, libraries, or their version numbers.
Experiment Setup Yes Experimental Setup. We simulate strategic labeling behavior by performing approximate gradient ascent (i.e., simultaneous perturbation stochastic approximation) on each labeler s utility Ji(ˆπ) w.r.t. the labelers internal reward parameters ˆθi, which govern their preference distribution Pˆθi. ... Labeler strategies are optimized for 200 steps. ... All results are averaged over 5 random seeds, and we report standard errors. The results below are for embedding dimension d = 16, number of labelers k = 5, and offline samples n = 20, 50, 100, 200. We compare the following approaches: (a) Naive MLEs... (b) Pessimistic Social Welfare [37]... (c) Median of MLEs... (d) Pessimistic Mo MLEs...