OntoFact: Unveiling Fantastic Fact-Skeleton of LLMs via Ontology-Driven Reinforcement Learning

Authors: Ziyu Shang, Wenjun Ke, Nana Xiu, Peng Wang, Jiajun Liu, Yanhui Li, Zhizhao Luo, Ke Ji

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on 5 datasets, using 32 representative LLMs, reveal a general lack of fact in current LLMs. Notably, Chat GPT exhibits fact error rates of 51.6% on DBpedia and 64.7% on YAGO, respectively. Additionally, the ORL mechanism demonstrates promising error prediction scores, with F1 scores ranging from 70% to 90% across most LLMs.
Researcher Affiliation Academia Ziyu Shang1*, Wenjun Ke1,2* , Nana Xiu3, Peng Wang1,2 , Jiajun Liu1, Yanhui Li4, Zhizhao Luo5, Ke Ji1 1School of Computer Science and Engineering, Southeast University 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China 3School of Cyber Science and Engineering, Southeast University 4State Key Laboratory for Novel Software Technology, Nanjing University 5Beijing Institute of Computer Technology and Application
Pseudocode Yes Algorithm 1: Ontology-Driven Reinforcement Learning
Open Source Code Yes We open source 5 large-scale and wide-ranging factdetection benchmarks to facilitate future research1, and offer feasible insights to tackle LLMs hallucination. 1https://github.com/seukgcode/Onto Fact
Open Datasets Yes We open source 5 large-scale and wide-ranging factdetection benchmarks to facilitate future research1, and offer feasible insights to tackle LLMs hallucination. 1https://github.com/seukgcode/Onto Fact. To investigate the LLMs factuality for general knowledge, we employ three large-scale KGs, where DBpedia (Lehmann et al. 2015) and YAGO 4.5 (Pellissier Tanon, Weikum, and Suchanek 2020) are in English (ENG) while CN-DBpedia (Xu et al. 2017) is in Chinese (CNS). Regarding specific domains, we adopt a bilingual biomedical KG (Yu et al. 2022), i.e., BIOS 2.2 (ENG) and BIOS 2.2 (CHS).
Dataset Splits Yes Specifically, we randomly select one-third of datasets for training, and calculate the above metric of ORL on the rest two-thirds parts.
Hardware Specification Yes All experiments are implemented on the NVIDIA A100 (80GB) GPU.
Software Dependencies Yes For LLMs in the English datasets and Chinese datasets, we utilize the t5 xxl true nli mixture2 and the Erlangshen-Megatron Bert-1.3B-NLI3, respectively.
Experiment Setup Yes In all experiments, for the embedding of instance graphs, the embedding size of entities and relations is 300 and 100, respectively. For the embedding of ontology graphs supplemented with ontology-level triples, the embedding size of concepts and properties is 100. Besides, the instance-view agent is the two-layer MLP in which the activation function of the hidden layer is Re LU, and the number of neural units is kept consistent with the size of the input dimensions. The value of γ in the total reward R(τ) for each ontology-level triple is 0.95, and the value of α is 12.0. For the threshold c, it is set to 0.5. In the ontology-view agent, both the actor network and the critic network are the two-layer MLP in which the activation function of the hidden layer is Re LU, and the number of neurons in the hidden layer is kept consistent with the size of the input dimension. The size of M in the optimized ontology-view agent is 2. The value of γ used in the optimized criticism network is 0.95. The value of β used in the soft optimization of the target action-critic network is 0.001. Moreover, three Adam optimizers with a learning rate of 1e 4 are used in ORL to optimize the actor network, the critic network in the ontology-view agent, and the instance-view agent, respectively.