Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Capturing Individual Human Preferences with Reward Features

Authors: Andre Barreto, Vincent Dumoulin, Yiran Mao, Mark Rowland, Nicolas Perez-Nieves, Bobak Shahriari, Yann Dauphin, Doina Precup, Hugo Larochelle

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present experiments with large language models illustrating our theoretical results and comparing the proposed architecture with a non-adaptive baseline. Consistent with our analysis, the benefits provided by our model increase with the number of raters and the heterogeneity of their preferences. We also show that our model compares favourably to adaptive counterparts, including those performing in-context personalisation.
Researcher Affiliation Industry André Barreto Google Deep Mind Vincent Dumoulin Google Deep Mind Yiran Mao Google Deep Mind Mark Rowland Google Deep Mind Nicolas Perez-Nieves Google Deep Mind Bobak Shahriari Google Deep Mind Yann Dauphin Google Deep Mind Doina Precup Google Deep Mind Hugo Larochelle Google Deep Mind
Pseudocode No The paper describes methods and processes in prose but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code No While the Gemma 1.1 2B [26] model used in our work is open-sourced, the specific training scripts used for our data preparation and model training are not included in this release because they are deeply integrated with the proprietary infrastructure used to carry out experiments, making them unsuitable for public release.
Open Datasets Yes We adopted Ultra Feedback, a dataset carefully curated to ensure the quality and diversity of the responses [18]. This is a relatively large dataset for this type of study: the version we adopted has a training set with 60, 829 examples and a test set with 985 examples. We performed experiments using two datasets: Ultra Feedback, as before, and also Zollo et al. s [65] Personal LLM.
Dataset Splits Yes The version we adopted has a training set with 60, 829 examples and a test set with 985 examples. We performed a random 90% 10% split of the training set and used the error in the smaller subset (a validation set) as a criterion to select the model to undergo adaptation.
Hardware Specification No We focus on the statistical (rather than computational) properties of the proposed approach. That is, both in our theoretical results and in our experiments we are mostly concerned with the methods sample complexity. All the techniques we use have well-understood demands in terms of compute and memory.
Software Dependencies No We used Google Deep Mind s [26] Gemma 1.1 2B model to implement both a baseline and RFM. ... The responses were generated by Google Deep Mind s [27] Gemma 2 9B and Gemma 2 27B (20 responses each).
Experiment Setup Yes Training was carried out for 6, 000 parameter updates with a batch size of 32. This means that the training procedure went over the entire Ultra Feedback training set approximately three times. Each time the example (xi, yi, y i, ) was encountered, a new rater ˆh was sampled uniformly at random from ˆH and the preference zi was determined through (20) with the corresponding ωˆh. ... Training and adaptation were carried out using gradient descent with a learning rate of 10 5. ... Unless otherwise noted, the default values for the parameters used in the experiments were: m = 60 raters, preference homogeneity level p = 0.7, and ˆn = 30 examples used for adaptation.