Teaching Language Models to Hallucinate Less with Synthetic Tasks
Authors: Erik Jones, Hamid Palangi, Clarisse Simões Ribeiro, Varun Chandrasekaran, Subhabrata Mukherjee, Arindam Mitra, Ahmed Hassan Awadallah, Ece Kamar
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across three realistic abstractive summarization tasks, SYNTRA reduces hallucination for two 13B-parameter LLMs using only a synthetic retrieval task for supervision. We evaluate Vicuna v1.1 (Chiang et al., 2023) and Orca (Mukherjee et al., 2023) on three realistic tasks: search-and-retrieve, meeting summarization, and clinical report generation. |
| Researcher Affiliation | Collaboration | 1 UC Berkeley 2 Microsoft Research 3 UIUC 4 Hippocratic AI |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format. |
| Open Source Code | No | The paper does not contain any statement about releasing open-source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We use MS MARCO as a source of examples (Nguyen et al., 2016)... We use the QMSum dataset as a source of examples (Zhong et al., 2021)... We use ACI-Bench (Yim et al., 2023) as a source of examples... For the reference data Dref, we use SQu AD (Rajpurkar et al., 2016) as a source of 50000 prompts. |
| Dataset Splits | Yes | For computational tractability, we select 1000 random queries from the MS MARCO validation set that require a long-form response (as labeled in the original dataset). ... We combine the train, validation, and three test splits for a total of 207 examples... We generate a dataset of 100,000 examples and test for hallucination... When optimizing on synthetic data mixed with the reference data, we use 50000 examples on the names task and 50000 examples on the reference task. |
| Hardware Specification | Yes | We perform all of our experiments on a single NVIDIA A100-PCIE-80GB GPU, except for fine-tuning, for which we use four A100s. |
| Software Dependencies | No | The paper mentions using 'Adam (Kingma & Ba, 2015)' and 'default Hugging Face parameters (Wolf et al., 2019)' but does not provide specific version numbers for software libraries like Hugging Face Transformers, PyTorch, or Python. |
| Experiment Setup | Yes | We sample with temperature 0.7 when generating, and have a max sequence length of 1024 tokens. ... We optimize the postfix with Adam (Kingma & Ba, 2015), using learning rate 1e-4, no weight decay, epsilon 1e-7... Specifically, we use a learning rate of 5e-5, warm up ratio of 0.03, weight decay is 0, and we run fine-tuning for one epoch with batch size of 12 per device... |