reproducibilityindex.ai

FactCHD: Benchmarking Fact-Conflicting Hallucination Detection

Authors: Xiang Chen, Duanzheng Song, Honghao Gui, Chenxi Wang, Ningyu Zhang, Yong Jiang, Fei Huang, Chengfei Lyu, Dan Zhang, Huajun Chen

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on different LLMs expose the shortcomings of current approaches in detecting factual errors accurately.
Researcher Affiliation	Collaboration	Xiang Chen1,4, Duanzheng Song2, Honghao Gui1,4, Chenxi Wang2,4, Ningyu Zhang2,4 , Yong Jiang3, Fei Huang3, Chengfei Lyu3, Dan Zhang2, Huajun Chen1,4 1College of Computer Science and Technology, Zhejiang University 2School of Software Technology, Zhejiang University 3Alibaba Group 4Zhejiang University Ant Group Joint Research Center for Knowledge Graphs
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Data is available at https://github.com/zjunlp/Fact CHD.
Open Datasets	Yes	1Data is available at https://github.com/zjunlp/Fact CHD.
Dataset Splits	No	Split #Sample VAN. MULTI. COMP. SET-OP. Train 51,383 31,986 8,209 5,691 5,497 Test 6,960 4,451 1,013 706 790 Table 2: Data Statistic of our FACTCHD
Hardware Specification	No	Using Azure s Open AI Chat GPT API, we generate samples with a temperature of 1.0 to control the diversity of generated samples, while limiting the maximum number of tokens to 2048 to ensure concise responses.
Software Dependencies	No	We evaluate various leading LLMs on FACTCHD benchmark, focusing on Open AI API models, including text-davinci-003 (Instruct GPT) and GPT-3.5-turbo (Chat GPT). Additionally, we explore the adoption of open-source models such as Llama2-chat, Alpaca[Taori et al., 2023] and Vicuna [Chiang et al., 2023], which are fine-tuned variants of the LLa MA [2023].
Experiment Setup	Yes	Using Azure s Open AI Chat GPT API, we generate samples with a temperature of 1.0 to control the diversity of generated samples, while limiting the maximum number of tokens to 2048 to ensure concise responses. We use a frequency penalty of zero and a Top-p of 1.0 to ensure unrestricted token selection during generation. For evaluations, we standardize the temperature at 0.2 to minimize randomness.