Fine-Tuning Language Models for Factuality

Authors: Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, Chelsea Finn

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we fine-tune language models to be more factual, without human labeling and targeting more open-ended generation settings than past work. We leverage two key recent innovations in NLP to do so. ... We show that learning from automatically generated factuality preference rankings, generated either through existing retrieval systems or our novel retrieval-free approach, significantly improves the factuality (percent of generated claims that are correct) of Llama-2 on held-out topics compared with RLHF or decoding strategies targeted at factuality. At 7B scale, compared to Llama-2-Chat, we observe 53% and 50% reduction in factual error rate when generating biographies and answering medical questions, respectively. Our experiments evaluate the extent to which factuality can be learned through preference-based reinforcement learning, using the fully automated preference-generation pipeline described in Section 3.
Researcher Affiliation Academia Katherine Tian* , Eric Mitchell* , Huaxiu Yao , Christopher D. Manning , Chelsea Finn Stanford University UNC Chapel Hill {kattian,eric.mitchell}@cs.stanford.edu
Pseudocode No The paper describes the method in text and with diagrams (Figure 1 and Figure 2) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes A reference implementation can be found at https://github.com/kttian/llm_factuality_tuning.
Open Datasets No For biographies, we generated a dataset consisting of 463 diverse well-known individuals (288 train, 50 val, 125 test) with 10 short-paragraph biographies each. For medical question answering, we used a dataset of 295 diverse common medical conditions (150 train, 45 val, 100 test) with 6 questions about each condition and 6 short-paragraph answers per question. ... Because Fact Score uses retrieval against a given Wikipedia article, we generate data based on individuals and medical conditions that have Wikipedia pages.
Dataset Splits Yes For biographies, we generated a dataset consisting of 463 diverse well-known individuals (288 train, 50 val, 125 test) with 10 short-paragraph biographies each. For medical question answering, we used a dataset of 295 diverse common medical conditions (150 train, 45 val, 100 test) with 6 questions about each condition and 6 short-paragraph answers per question.
Hardware Specification No The paper mentions using Llama-1-7b and Llama-2-7b models but does not specify any hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies No The paper mentions the use of GPT-3.5 and Llama models, but does not provide specific version numbers for any software dependencies like programming languages, frameworks, or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes First, we sample n multiple candidate responses for each prompt from the model with simple temperature sampling with temperature 1.0 (using few-shot prompting for models that have not been fine-tuned).