FactCHD: Benchmarking Fact-Conflicting Hallucination Detection
Authors: Xiang Chen, Duanzheng Song, Honghao Gui, Chenxi Wang, Ningyu Zhang, Yong Jiang, Fei Huang, Chengfei Lyu, Dan Zhang, Huajun Chen
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on different LLMs expose the shortcomings of current approaches in detecting factual errors accurately. |
| Researcher Affiliation | Collaboration | Xiang Chen1,4, Duanzheng Song2, Honghao Gui1,4, Chenxi Wang2,4, Ningyu Zhang2,4 , Yong Jiang3, Fei Huang3, Chengfei Lyu3, Dan Zhang2, Huajun Chen1,4 1College of Computer Science and Technology, Zhejiang University 2School of Software Technology, Zhejiang University 3Alibaba Group 4Zhejiang University Ant Group Joint Research Center for Knowledge Graphs |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Data is available at https://github.com/zjunlp/Fact CHD. |
| Open Datasets | Yes | 1Data is available at https://github.com/zjunlp/Fact CHD. |
| Dataset Splits | No | Split #Sample VAN. MULTI. COMP. SET-OP. Train 51,383 31,986 8,209 5,691 5,497 Test 6,960 4,451 1,013 706 790 Table 2: Data Statistic of our FACTCHD |
| Hardware Specification | No | Using Azure s Open AI Chat GPT API, we generate samples with a temperature of 1.0 to control the diversity of generated samples, while limiting the maximum number of tokens to 2048 to ensure concise responses. |
| Software Dependencies | No | We evaluate various leading LLMs on FACTCHD benchmark, focusing on Open AI API models, including text-davinci-003 (Instruct GPT) and GPT-3.5-turbo (Chat GPT). Additionally, we explore the adoption of open-source models such as Llama2-chat, Alpaca[Taori et al., 2023] and Vicuna [Chiang et al., 2023], which are fine-tuned variants of the LLa MA [2023]. |
| Experiment Setup | Yes | Using Azure s Open AI Chat GPT API, we generate samples with a temperature of 1.0 to control the diversity of generated samples, while limiting the maximum number of tokens to 2048 to ensure concise responses. We use a frequency penalty of zero and a Top-p of 1.0 to ensure unrestricted token selection during generation. For evaluations, we standardize the temperature at 0.2 to minimize randomness. |