reproducibilityindex.ai

Statistical Knowledge Assessment for Large Language Models

Authors: Qingxiu Dong, Jingjing Xu, Lingpeng Kong, Zhifang Sui, Lei Li

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our assessment suite contains a comprehensive set of 994,123 entities and 600 relations, with 1,395,905 text aliases. We use our method to evaluate 20 LLMs of various sizes, including LLa MA, Alpaca, OPT, etc. Experiments show that our results have a strong correlation (0.43 Kendall s τ) with the results of human assessment on LLMs.
Researcher Affiliation	Academia	1 National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University 2 Shanghai AI Lab 3 The University of Hong Kong 4 Carnegie Mellon University
Pseudocode	No	The paper provides mathematical derivations and a graphical model, but no pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Our code and data are available at https://github.com/dqxiu/KAssess.
Open Datasets	Yes	We utilize T-REx knowledge graph [Elsahar et al., 2018] as our primary source of symbolic knowledge. (...) For the text forms of subjects and objects involved in calculating Ka RRs and Ka RRr, we search the entity aliases from Wikidata with Wikidata Integrator.
Dataset Splits	Yes	In our main experiments, we consider all 600 English relations available in T-REx and sample a maximum of 20 facts per relation, resulting in a total of 10,691 facts for knowledge assessment. (...) The methods evaluated through human evaluation are assessed using the same set of 410 randomly sampled facts. (...) We set 22 as the threshold of Ka RR, which is chosen by aligning the proportion of GPT2-XL known facts distinguished by humans on a sampled set of 200 facts.
Hardware Specification	No	The paper does not specify any particular hardware (e.g., GPU models, CPU types, or cloud instance specifications) used for running the experiments.
Software Dependencies	No	The paper mentions software like Flan-T5, Wikidata Integrator, Huggingface Models, and Metaseq, but does not provide specific version numbers for these software dependencies to ensure reproducibility.
Experiment Setup	Yes	Our default sampling parameter, K, is set to 4. (...) We set 22 as the threshold for Ka RR, and the threshold is chosen by aligning the proportion of GPT2-XL known facts distinguished by humans on a sampled set of facts. Also as a result of alignment with human-recognized proportion, the threshold for the K-Prompts baseline implemented in this paper is set to 0.13. (...) In total, we obtained 4,140 valid manually annotated prompts for 410 facts in T-REx.