Fine-tuning language models to find agreement among humans with diverse preferences
Authors: Michiel Bakker, Martin Chadwick, Hannah Sheahan, Michael Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matt Botvinick, Christopher Summerfield
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., should we raise taxes on the rich? ), and rate the LLM s generated candidate consensus statements for agreement and quality. A reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. The model produces consensus statements that are preferred by human users over those from prompted LLMs (> 70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step. |
| Researcher Affiliation | Collaboration | Michiel A. Bakker Deep Mind miba@deepmind.com Martin J. Chadwick Deep Mind martin@deepmind.com Hannah R. Sheahan Deep Mind hsheahan@deepmind.com Michael Henry Tessler Deep Mind tesslerm@deepmind.com Lucy Campbell-Gillingham Deep Mind lcgillingham@deepmind.com Jan Balaguer Deep Mind jua@deepmind.com Nat Mc Aleese Deep Mind nmca@deepmind.com Amelia Glaese Deep Mind glamia@deepmind.com John Aslanides Deep Mind jaslanides@deepmind.com Matthew M. Botvinick Deep Mind University College London botvinick@deepmind.com Christopher Summerfield Deep Mind University of Oxford csummerfield@deepmind.com |
| Pseudocode | No | The paper outlines a multi-step training process (Step 1, Step 2, Step 3) in Section 3.4, describing the procedures in detail. However, these steps are presented in natural language within paragraphs and do not constitute formal pseudocode blocks or algorithm figures. |
| Open Source Code | No | We are not releasing the code or data. |
| Open Datasets | No | The paper describes generating and collecting its own dataset of debate questions and human opinions. It states: 'We created a large data set of debate questions and built a customized environment and pipeline that allowed us to collect human opinions and fine-tune our models in an iterative loop (Figure 1).' However, the paper explicitly states: 'We are not releasing the code or data.', and no public access information (URL, DOI, repository, or formal citation to a publicly available version of their collected dataset) is provided. |
| Dataset Splits | No | The paper specifies training, within-distribution hold-out, and out-of-distribution hold-out sets for questions. However, it does not explicitly mention a separate 'validation' dataset split with specific percentages or counts for hyperparameter tuning or model selection during its training process. The data collected from human ratings is used for both training the reward model and for evaluation, but a distinct validation set is not detailed. |
| Hardware Specification | Yes | We trained our models using Tensor Processing Units (TPUv3). The supervised fine-tuning models were fine-tuned using 64 TPU cores for 200 steps. The reward models were trained using 32 TPU cores for 1500 steps. |
| Software Dependencies | No | The paper does not provide specific version numbers for any ancillary software dependencies (e.g., programming languages like Python with version, or libraries like PyTorch/TensorFlow with versions). It mentions using 'Chinchilla [17]' as the base LLM, but this is a model, not a software package dependency with a version. |
| Experiment Setup | Yes | SFT training details including the prompt template and hyperparameters can be found in Appendix C.1.2. |