Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-Based Retrofitting

Authors: Xinyan Guan, Yanjiang Liu, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that KGR can significantly improve the performance of LLMs on factual QA benchmarks especially when involving complex reasoning processes, which demonstrates the necessity and effectiveness of KGR in mitigating hallucination and enhancing the reliability of LLMs. Experiments We evaluate our KGR framework on three datasets with different levels of reasoning difficulty, including Simple Question (Bordes et al. 2015), Mintaka (Sen, Aji, and Saffari 2022), and Hotpot QA (Yang et al. 2018).
Researcher Affiliation Academia Xinyan Guan1,2*, Yanjiang Liu1,2*, Hongyu Lin1, Yaojie Lu1 , Ben He1,2, Xianpei Han1,3, Le Sun1,2 1Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3State Key Laboratory of Computer Science Institute of Software, Chinese Academy of Sciences, Beijing, China
Pseudocode No The paper describes the steps of the KGR framework in detail, supported by diagrams, but does not include formal pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statement about releasing the source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes We conduct experiments on three representative factual QA benchmarks, including: Simple Question (Bordes et al. 2015) is a simple QA dataset... Mintaka (Sen, Aji, and Saffari 2022) is a complex, natural and multilingual dataset... Hotpot QA (Yang et al. 2018) is a Wikipedia-based 1 dataset... We choose Wikidata2(Vrandeˇci c and Kr otzsch 2014) as our knowledge base...
Dataset Splits Yes We reported the results in terms of EM and F1 scores respectively on 50 samples from the validation set of each dataset.
Hardware Specification No The paper mentions evaluating on 'text-davinci-003', 'Chat GPT (gpt-3.5-turbo-0301)', and 'Vicuna 13B', but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper refers to specific LLMs (e.g., text-davinci-003, Chat GPT, Vicuna 13B) and a knowledge base (Wikidata) but does not list any specific software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup No The 'Experiment Settings' section details the datasets and LLMs used, but it does not provide specific hyperparameter values (e.g., learning rate, batch size, epochs, optimizer settings) or other detailed system-level training configurations.