reproducibilityindex.ai

Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph

Authors: Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, Jian Guo

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use a number of well-designed experiments to examine and illustrate the following advantages of ToG: 1) compared with LLMs, ToG has better deep reasoning power; 2) ToG has the ability of knowledge traceability and knowledge correctability by leveraging LLMs reasoning and expert feedback; 3) ToG provides a flexible plugand-play framework for different LLMs, KGs and prompting strategies without any additional training cost; 4) the performance of ToG with small LLM models could exceed large LLM such as GPT-4 in certain scenarios and this reduces the cost of LLM deployment and application. As a training-free method with lower computational cost and better generality, ToG achieves overall SOTA in 6 out of 9 datasets where most previous SOTAs rely on additional training.
Researcher Affiliation	Collaboration	Jiashuo Sun21 Chengjin Xu1 Lumingyuan Tang31 Saizhuo Wang41 Chen Lin2 Yeyun Gong6 Lionel M. Ni5 Heung-Yeung Shum14 Jian Guo15 1IDEA Research, International Digital Economy Academy 2Xiamen University 3University of Southern California 4The Hong Kong University of Science and Technology 5The Hong Kong University of Science and Technology (Guangzhou) 6Microsoft Research Asia
Pseudocode	Yes	Algorithm 1 and 2 show the implementation details of the ToG and ToG-R.
Open Source Code	Yes	Our code is publicly available at https://github.com/IDEA-FinAI/ToG.
Open Datasets	Yes	In order to test ToG s ability on multi-hop knowledge-intensive reasoning tasks, we evaluate ToG on five KBQA datasets (4 Multi-hop and 1 Single-hop): CWQ (Talmor & Berant, 2018), Web QSP (Yih et al., 2016), Grail QA (Gu et al., 2021), QALD10-en (Perevalov et al., 2022), Simple Questions (Bordes et al., 2015).
Dataset Splits	No	For two big datasets Grail QA and Simple Questions, we only randomly selected 1,000 samples each for testing in order to save computational cost.
Hardware Specification	Yes	Llama2-70B-Chat (Touvron et al., 2023) runs with 8 A100-40G without quantization, where the temperature parameter is set to 0.4 for exploration process (increasing diversity) and set to 0 for reasoning process (guaranteeing reproducibility).
Software Dependencies	No	The paper mentions specific LLM models (Chat GPT, GPT-4, Llama2-70B-Chat) and KGs (Freebase, Wikidata), but does not list software dependencies with specific version numbers (e.g., Python, PyTorch versions).
Experiment Setup	Yes	Llama2-70B-Chat (Touvron et al., 2023) runs with 8 A100-40G without quantization, where the temperature parameter is set to 0.4 for exploration process (increasing diversity) and set to 0 for reasoning process (guaranteeing reproducibility). The maximum token length for the generation is set to 256. In all experiments, we set both width N and depth Dmax to 3 for beam search. Freebase (Bollacker et al., 2008) is used as KG for CWQ, Web QSP, Grail QA, Simple Questions, and Webquestions, and Wikidata (Vrandeˇci c & Krötzsch, 2014) is used as KG for QALD10-en, T-REx, Zero-Shot RE and Creak. We use 5 shots in ToG-reasoning prompts for all the datasets.