reproducibilityindex.ai

Code-Style In-Context Learning for Knowledge-Based Question Answering

Authors: Zhijie Nie, Richong Zhang, Zhongyuan Wang, Xudong Liu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on three mainstream datasets show that our method dramatically mitigated the formatting error problem in generating logic forms while realizing a new SOTA on Web QSP, Grail QA, and Graph Q under the few-shot setting.
Researcher Affiliation	Academia	Zhijie Nie1,3, Richong Zhang1,2 *, Zhongyuan Wang1, Xudong Liu1 1SKLSDE, School of Computer Science and Engineering, Beihang University, Beijing, China 2Zhongguancun Laboratory, Beijing, China 3Shen Yuan Honors College, Beihang University, Beijing, China {niezj,zhangrc,wangzy23,liuxd}@act.buaa.edu.cn
Pseudocode	Yes	Finally, the Python implementation of the meta-functions, all (question, function call sequences) pairs, and the test question are reformatted in code form as input to the LLM. And the LLM is expected to complement a complete function call sequence for the new question to obtain the correct logic form. ... the code implementation of seven meta-functions. ... the complete contents of I are shown in the code from line 1 to line 26 in Figure 4. ... Finally, the demo example corresponding to Figure 2 is reformulated as the code from line 28 to line 35 in Figure 4.
Open Source Code	Yes	The code and supplementary files are released at https://github.com/ Arthurizijar/KB-Coder.
Open Datasets	Yes	We use three mainstream datasets in KBQA, Web QSP (Yih et al. 2016), Graph Q (Su et al. 2016), and Grail QA (Gu et al. 2021), which represent the three generalization capabilities of i.i.d, compositional, and zero-shot, respectively, to evaluate the effect of KB-Coder.
Dataset Splits	Yes	Consistent with previous works (Yu et al. 2022; Li et al. 2023c), we report F1 Score on Web QSP and Graph Q, while Exact Match (EM) and F1 Score on Grail QA as performance metrics. ... Consistent with KB-BINDER (Li et al. 2023b), we conduct 100-shot for Web QSP and Graph Q, and 40-shot for Grail QA.
Hardware Specification	No	Due to the deprecation of the Codex family of models, we select gpt-3.5-turbo from Open AI for our experiments. In all experiments, we used the official API 1 to obtain model results, where temperature is set to 0.7, max tokens is set to 300, and other parameters are kept at default values.
Software Dependencies	No	In practice, we use S-Expression defined by Gu et al. (2021) as the logical form l due to its simplicity. ... Due to the successful practice of Codex (Chen et al. 2021) in Python, we select Python to implement these functions. ... for entity linking, we first convert all surface names of all entities in the KB into representations by the off-the-shelf embedding model, Sim CSE (Gao, Yao, and Chen 2021), and build the entity index with Faiss (Johnson, Douze, and J egou 2019).
Experiment Setup	Yes	In all experiments, we used the official API 1 to obtain model results, where temperature is set to 0.7, max tokens is set to 300, and other parameters are kept at default values. ... Without special instructions, we reported the experiment results with Me = 15 and Mr = 100.