Human Feedback is not Gold Standard

Authors: Tom Hosking, Phil Blunsom, Max Bartolo

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we analyse human annotation of model outputs, both for overall preference scores and for specific error criteria. In Section 2 we establish a set of error types that are task independent and act as minimum requirements for model outputs. We analyse the error coverage of overall preference scores. We ask two sets of annotators to rate a range of LLM outputs, the first according to these error types and the second according to their own judgements of overall quality, and find that overall preference scores under-represent factuality and faithfulness. In Section 3, we consider two possible sources of bias when annotating for specific error types by generating outputs with varying assertiveness and complexity, and find that assertiveness strongly biases human factuality judgements.
Researcher Affiliation Collaboration Tom Hosking University of Edinburgh tom.hosking@ed.ac.uk Phil Blunsom Cohere phil@cohere.com Max Bartolo Cohere, UCL max@cohere.com
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes Our code and data are available at https://github.com/cohere-ai/human-feedback-paper.
Open Datasets Yes Datasets To cover a range of different tasks for which evaluation is challenging, we construct input prompts from three datasets: Curation Corpus (Curation, 2020) is a summarization dataset composed of 40,000 news articles and professionally written summaries; Amazon Product Descriptions (Ni et al., 2019) gives a product title and specification as input and requires generating a compelling product description; and Wikihow (Koupaee & Wang, 2018) consists of how to questions and step-by-step guides.
Dataset Splits No We annotate a total of 900 distinct outputs, with a total of 4,440 annotations including quality checks.
Hardware Specification No The paper does not provide specific details on the hardware (e.g., GPU/CPU models, memory) used for running the experiments or generating model outputs.
Software Dependencies No The paper mentions specific models like “Llama 2 13B Chat” and the annotation interface “Potato (Pei et al., 2022)”, but it does not provide specific version numbers for key software components or libraries like programming languages, frameworks, or other dependencies required for reproduction.
Experiment Setup Yes Outputs were sampled using a temperature of 0.7. Llama 2 The full Hugging Face model ID used was meta-llama/Llama-2-13b-chat-hf. The prompt template used was: [INST] <<SYS>> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. <</SYS>> {instruction} [/INST]