reproducibilityindex.ai

Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models

Authors: Qitan Lv, Jie Wang, Hanzhu Chen, Bin Li, Yongdong Zhang, Feng Wu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on the knowledge hallucination benchmark demonstrate the effectiveness of COFT, leading to a superior performance over 30% in the F1 score metric. Moreover, COFT also exhibits remarkable versatility across various long-form tasks, such as reading comprehension and question answering.
Researcher Affiliation	Academia	1CAS Key Laboratory of Technology in GIPAS & Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China.
Pseudocode	Yes	Algorithm 1 Pseudo code for entity-level iterative algorithm
Open Source Code	No	The paper mentions implementing their approach based on PyTorch and Huggingface's Transformers (third-party libraries), but does not state that their own COFT implementation code is open-source or provide a link to it.
Open Datasets	Yes	For knowledge hallucination, we use FELM (Chen et al., 2023c) as the benchmark... For reading comprehension, we use RACE-H (high school level reading comprehension) and RACE-M (middle school level reading comprehension) (Lai et al., 2017)... For question answering, we use Natural Questions (Kwiatkowski et al., 2019), Trivia QA (Joshi et al., 2017), and Web Q (Berant et al., 2013) as our benchmarks.
Dataset Splits	Yes	Table 5. Statistics of the reading comprehension benchmarks, RACE-H and RACE-M. The values below the Training/Valid/Testing Set are the number of passages and questions in each dataset, respectively.
Hardware Specification	Yes	All experiments were performed on four Nvidia A100 GPUs (80GB).
Software Dependencies	Yes	We implement our approach based on Py Torch 1.13.05 and Huggingface s Transformers6.
Experiment Setup	Yes	To guarantee stable and reproducible results, we utilize greedy decoding and set the temperature parameter as 0 in all experiments. ... For the small language models used for calculating self-information, we apply LLa MA-7B7