Statistical Knowledge Assessment for Large Language Models
Authors: Qingxiu Dong, Jingjing Xu, Lingpeng Kong, Zhifang Sui, Lei Li
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our assessment suite contains a comprehensive set of 994,123 entities and 600 relations, with 1,395,905 text aliases. We use our method to evaluate 20 LLMs of various sizes, including LLa MA, Alpaca, OPT, etc. Experiments show that our results have a strong correlation (0.43 Kendall s τ) with the results of human assessment on LLMs. |
| Researcher Affiliation | Academia | 1 National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University 2 Shanghai AI Lab 3 The University of Hong Kong 4 Carnegie Mellon University |
| Pseudocode | No | The paper provides mathematical derivations and a graphical model, but no pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Our code and data are available at https://github.com/dqxiu/KAssess. |
| Open Datasets | Yes | We utilize T-REx knowledge graph [Elsahar et al., 2018] as our primary source of symbolic knowledge. (...) For the text forms of subjects and objects involved in calculating Ka RRs and Ka RRr, we search the entity aliases from Wikidata with Wikidata Integrator. |
| Dataset Splits | Yes | In our main experiments, we consider all 600 English relations available in T-REx and sample a maximum of 20 facts per relation, resulting in a total of 10,691 facts for knowledge assessment. (...) The methods evaluated through human evaluation are assessed using the same set of 410 randomly sampled facts. (...) We set 22 as the threshold of Ka RR, which is chosen by aligning the proportion of GPT2-XL known facts distinguished by humans on a sampled set of 200 facts. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU models, CPU types, or cloud instance specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions software like Flan-T5, Wikidata Integrator, Huggingface Models, and Metaseq, but does not provide specific version numbers for these software dependencies to ensure reproducibility. |
| Experiment Setup | Yes | Our default sampling parameter, K, is set to 4. (...) We set 22 as the threshold for Ka RR, and the threshold is chosen by aligning the proportion of GPT2-XL known facts distinguished by humans on a sampled set of facts. Also as a result of alignment with human-recognized proportion, the threshold for the K-Prompts baseline implemented in this paper is set to 0.13. (...) In total, we obtained 4,140 valid manually annotated prompts for 410 facts in T-REx. |