reproducibilityindex.ai

Teaching Language Models to Hallucinate Less with Synthetic Tasks

Authors: Erik Jones, Hamid Palangi, Clarisse Simões Ribeiro, Varun Chandrasekaran, Subhabrata Mukherjee, Arindam Mitra, Ahmed Hassan Awadallah, Ece Kamar

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across three realistic abstractive summarization tasks, SYNTRA reduces hallucination for two 13B-parameter LLMs using only a synthetic retrieval task for supervision. We evaluate Vicuna v1.1 (Chiang et al., 2023) and Orca (Mukherjee et al., 2023) on three realistic tasks: search-and-retrieve, meeting summarization, and clinical report generation.
Researcher Affiliation	Collaboration	1 UC Berkeley 2 Microsoft Research 3 UIUC 4 Hippocratic AI
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format.
Open Source Code	No	The paper does not contain any statement about releasing open-source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	We use MS MARCO as a source of examples (Nguyen et al., 2016)... We use the QMSum dataset as a source of examples (Zhong et al., 2021)... We use ACI-Bench (Yim et al., 2023) as a source of examples... For the reference data Dref, we use SQu AD (Rajpurkar et al., 2016) as a source of 50000 prompts.
Dataset Splits	Yes	For computational tractability, we select 1000 random queries from the MS MARCO validation set that require a long-form response (as labeled in the original dataset). ... We combine the train, validation, and three test splits for a total of 207 examples... We generate a dataset of 100,000 examples and test for hallucination... When optimizing on synthetic data mixed with the reference data, we use 50000 examples on the names task and 50000 examples on the reference task.
Hardware Specification	Yes	We perform all of our experiments on a single NVIDIA A100-PCIE-80GB GPU, except for fine-tuning, for which we use four A100s.
Software Dependencies	No	The paper mentions using 'Adam (Kingma & Ba, 2015)' and 'default Hugging Face parameters (Wolf et al., 2019)' but does not provide specific version numbers for software libraries like Hugging Face Transformers, PyTorch, or Python.
Experiment Setup	Yes	We sample with temperature 0.7 when generating, and have a max sequence length of 1024 tokens. ... We optimize the postfix with Adam (Kingma & Ba, 2015), using learning rate 1e-4, no weight decay, epsilon 1e-7... Specifically, we use a learning rate of 5e-5, warm up ratio of 0.03, weight decay is 0, and we run fine-tuning for one epoch with batch size of 12 per device...