FactCHD: Benchmarking Fact-Conflicting Hallucination Detection

Authors: Xiang Chen, Duanzheng Song, Honghao Gui, Chenxi Wang, Ningyu Zhang, Yong Jiang, Fei Huang, Chengfei Lyu, Dan Zhang, Huajun Chen

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on different LLMs expose the shortcomings of current approaches in detecting factual errors accurately.
Researcher Affiliation Collaboration Xiang Chen1,4, Duanzheng Song2, Honghao Gui1,4, Chenxi Wang2,4, Ningyu Zhang2,4 , Yong Jiang3, Fei Huang3, Chengfei Lyu3, Dan Zhang2, Huajun Chen1,4 1College of Computer Science and Technology, Zhejiang University 2School of Software Technology, Zhejiang University 3Alibaba Group 4Zhejiang University Ant Group Joint Research Center for Knowledge Graphs
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes 1Data is available at https://github.com/zjunlp/Fact CHD.
Open Datasets Yes 1Data is available at https://github.com/zjunlp/Fact CHD.
Dataset Splits No Split #Sample VAN. MULTI. COMP. SET-OP. Train 51,383 31,986 8,209 5,691 5,497 Test 6,960 4,451 1,013 706 790 Table 2: Data Statistic of our FACTCHD
Hardware Specification No Using Azure s Open AI Chat GPT API, we generate samples with a temperature of 1.0 to control the diversity of generated samples, while limiting the maximum number of tokens to 2048 to ensure concise responses.
Software Dependencies No We evaluate various leading LLMs on FACTCHD benchmark, focusing on Open AI API models, including text-davinci-003 (Instruct GPT) and GPT-3.5-turbo (Chat GPT). Additionally, we explore the adoption of open-source models such as Llama2-chat, Alpaca[Taori et al., 2023] and Vicuna [Chiang et al., 2023], which are fine-tuned variants of the LLa MA [2023].
Experiment Setup Yes Using Azure s Open AI Chat GPT API, we generate samples with a temperature of 1.0 to control the diversity of generated samples, while limiting the maximum number of tokens to 2048 to ensure concise responses. We use a frequency penalty of zero and a Top-p of 1.0 to ensure unrestricted token selection during generation. For evaluations, we standardize the temperature at 0.2 to minimize randomness.