Fine-tuning language models to find agreement among humans with diverse preferences

Authors: Michiel Bakker, Martin Chadwick, Hannah Sheahan, Michael Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matt Botvinick, Christopher Summerfield

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., should we raise taxes on the rich? ), and rate the LLM s generated candidate consensus statements for agreement and quality. A reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. The model produces consensus statements that are preferred by human users over those from prompted LLMs (> 70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step.
Researcher Affiliation Collaboration Michiel A. Bakker Deep Mind miba@deepmind.com Martin J. Chadwick Deep Mind martin@deepmind.com Hannah R. Sheahan Deep Mind hsheahan@deepmind.com Michael Henry Tessler Deep Mind tesslerm@deepmind.com Lucy Campbell-Gillingham Deep Mind lcgillingham@deepmind.com Jan Balaguer Deep Mind jua@deepmind.com Nat Mc Aleese Deep Mind nmca@deepmind.com Amelia Glaese Deep Mind glamia@deepmind.com John Aslanides Deep Mind jaslanides@deepmind.com Matthew M. Botvinick Deep Mind University College London botvinick@deepmind.com Christopher Summerfield Deep Mind University of Oxford csummerfield@deepmind.com
Pseudocode No The paper outlines a multi-step training process (Step 1, Step 2, Step 3) in Section 3.4, describing the procedures in detail. However, these steps are presented in natural language within paragraphs and do not constitute formal pseudocode blocks or algorithm figures.
Open Source Code No We are not releasing the code or data.
Open Datasets No The paper describes generating and collecting its own dataset of debate questions and human opinions. It states: 'We created a large data set of debate questions and built a customized environment and pipeline that allowed us to collect human opinions and fine-tune our models in an iterative loop (Figure 1).' However, the paper explicitly states: 'We are not releasing the code or data.', and no public access information (URL, DOI, repository, or formal citation to a publicly available version of their collected dataset) is provided.
Dataset Splits No The paper specifies training, within-distribution hold-out, and out-of-distribution hold-out sets for questions. However, it does not explicitly mention a separate 'validation' dataset split with specific percentages or counts for hyperparameter tuning or model selection during its training process. The data collected from human ratings is used for both training the reward model and for evaluation, but a distinct validation set is not detailed.
Hardware Specification Yes We trained our models using Tensor Processing Units (TPUv3). The supervised fine-tuning models were fine-tuned using 64 TPU cores for 200 steps. The reward models were trained using 32 TPU cores for 1500 steps.
Software Dependencies No The paper does not provide specific version numbers for any ancillary software dependencies (e.g., programming languages like Python with version, or libraries like PyTorch/TensorFlow with versions). It mentions using 'Chinchilla [17]' as the base LLM, but this is a model, not a software package dependency with a version.
Experiment Setup Yes SFT training details including the prompt template and hyperparameters can be found in Appendix C.1.2.