Human Feedback is not Gold Standard
Authors: Tom Hosking, Phil Blunsom, Max Bartolo
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we analyse human annotation of model outputs, both for overall preference scores and for specific error criteria. In Section 2 we establish a set of error types that are task independent and act as minimum requirements for model outputs. We analyse the error coverage of overall preference scores. We ask two sets of annotators to rate a range of LLM outputs, the first according to these error types and the second according to their own judgements of overall quality, and find that overall preference scores under-represent factuality and faithfulness. In Section 3, we consider two possible sources of bias when annotating for specific error types by generating outputs with varying assertiveness and complexity, and find that assertiveness strongly biases human factuality judgements. |
| Researcher Affiliation | Collaboration | Tom Hosking University of Edinburgh tom.hosking@ed.ac.uk Phil Blunsom Cohere phil@cohere.com Max Bartolo Cohere, UCL max@cohere.com |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | Our code and data are available at https://github.com/cohere-ai/human-feedback-paper. |
| Open Datasets | Yes | Datasets To cover a range of different tasks for which evaluation is challenging, we construct input prompts from three datasets: Curation Corpus (Curation, 2020) is a summarization dataset composed of 40,000 news articles and professionally written summaries; Amazon Product Descriptions (Ni et al., 2019) gives a product title and specification as input and requires generating a compelling product description; and Wikihow (Koupaee & Wang, 2018) consists of how to questions and step-by-step guides. |
| Dataset Splits | No | We annotate a total of 900 distinct outputs, with a total of 4,440 annotations including quality checks. |
| Hardware Specification | No | The paper does not provide specific details on the hardware (e.g., GPU/CPU models, memory) used for running the experiments or generating model outputs. |
| Software Dependencies | No | The paper mentions specific models like “Llama 2 13B Chat” and the annotation interface “Potato (Pei et al., 2022)”, but it does not provide specific version numbers for key software components or libraries like programming languages, frameworks, or other dependencies required for reproduction. |
| Experiment Setup | Yes | Outputs were sampled using a temperature of 0.7. Llama 2 The full Hugging Face model ID used was meta-llama/Llama-2-13b-chat-hf. The prompt template used was: [INST] <<SYS>> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. <</SYS>> {instruction} [/INST] |