Teaching Language Models to Hallucinate Less with Synthetic Tasks

Authors: Erik Jones, Hamid Palangi, Clarisse Simões Ribeiro, Varun Chandrasekaran, Subhabrata Mukherjee, Arindam Mitra, Ahmed Hassan Awadallah, Ece Kamar

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across three realistic abstractive summarization tasks, SYNTRA reduces hallucination for two 13B-parameter LLMs using only a synthetic retrieval task for supervision. We evaluate Vicuna v1.1 (Chiang et al., 2023) and Orca (Mukherjee et al., 2023) on three realistic tasks: search-and-retrieve, meeting summarization, and clinical report generation.
Researcher Affiliation Collaboration 1 UC Berkeley 2 Microsoft Research 3 UIUC 4 Hippocratic AI
Pseudocode No The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format.
Open Source Code No The paper does not contain any statement about releasing open-source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We use MS MARCO as a source of examples (Nguyen et al., 2016)... We use the QMSum dataset as a source of examples (Zhong et al., 2021)... We use ACI-Bench (Yim et al., 2023) as a source of examples... For the reference data Dref, we use SQu AD (Rajpurkar et al., 2016) as a source of 50000 prompts.
Dataset Splits Yes For computational tractability, we select 1000 random queries from the MS MARCO validation set that require a long-form response (as labeled in the original dataset). ... We combine the train, validation, and three test splits for a total of 207 examples... We generate a dataset of 100,000 examples and test for hallucination... When optimizing on synthetic data mixed with the reference data, we use 50000 examples on the names task and 50000 examples on the reference task.
Hardware Specification Yes We perform all of our experiments on a single NVIDIA A100-PCIE-80GB GPU, except for fine-tuning, for which we use four A100s.
Software Dependencies No The paper mentions using 'Adam (Kingma & Ba, 2015)' and 'default Hugging Face parameters (Wolf et al., 2019)' but does not provide specific version numbers for software libraries like Hugging Face Transformers, PyTorch, or Python.
Experiment Setup Yes We sample with temperature 0.7 when generating, and have a max sequence length of 1024 tokens. ... We optimize the postfix with Adam (Kingma & Ba, 2015), using learning rate 1e-4, no weight decay, epsilon 1e-7... Specifically, we use a learning rate of 5e-5, warm up ratio of 0.03, weight decay is 0, and we run fine-tuning for one epoch with batch size of 12 per device...