Fine-Tuning Language Models for Factuality
Authors: Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, Chelsea Finn
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we fine-tune language models to be more factual, without human labeling and targeting more open-ended generation settings than past work. We leverage two key recent innovations in NLP to do so. ... We show that learning from automatically generated factuality preference rankings, generated either through existing retrieval systems or our novel retrieval-free approach, significantly improves the factuality (percent of generated claims that are correct) of Llama-2 on held-out topics compared with RLHF or decoding strategies targeted at factuality. At 7B scale, compared to Llama-2-Chat, we observe 53% and 50% reduction in factual error rate when generating biographies and answering medical questions, respectively. Our experiments evaluate the extent to which factuality can be learned through preference-based reinforcement learning, using the fully automated preference-generation pipeline described in Section 3. |
| Researcher Affiliation | Academia | Katherine Tian* , Eric Mitchell* , Huaxiu Yao , Christopher D. Manning , Chelsea Finn Stanford University UNC Chapel Hill {kattian,eric.mitchell}@cs.stanford.edu |
| Pseudocode | No | The paper describes the method in text and with diagrams (Figure 1 and Figure 2) but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | A reference implementation can be found at https://github.com/kttian/llm_factuality_tuning. |
| Open Datasets | No | For biographies, we generated a dataset consisting of 463 diverse well-known individuals (288 train, 50 val, 125 test) with 10 short-paragraph biographies each. For medical question answering, we used a dataset of 295 diverse common medical conditions (150 train, 45 val, 100 test) with 6 questions about each condition and 6 short-paragraph answers per question. ... Because Fact Score uses retrieval against a given Wikipedia article, we generate data based on individuals and medical conditions that have Wikipedia pages. |
| Dataset Splits | Yes | For biographies, we generated a dataset consisting of 463 diverse well-known individuals (288 train, 50 val, 125 test) with 10 short-paragraph biographies each. For medical question answering, we used a dataset of 295 diverse common medical conditions (150 train, 45 val, 100 test) with 6 questions about each condition and 6 short-paragraph answers per question. |
| Hardware Specification | No | The paper mentions using Llama-1-7b and Llama-2-7b models but does not specify any hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions the use of GPT-3.5 and Llama models, but does not provide specific version numbers for any software dependencies like programming languages, frameworks, or libraries (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | First, we sample n multiple candidate responses for each prompt from the model with simple temperature sampling with temperature 1.0 (using few-shot prompting for models that have not been fine-tuned). |