Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation

Authors: Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Yixin Zhu, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Weijie Wu, Qianyu He, Rui Xu, Wenhao Huang, Jingping Liu, Zili Wang, Shusen Wang, Weiguo Zheng, Hongwei Feng, Yanghua Xiao

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results indicate that LLMs exceed average performance of humans in science, engineering, agronomy, medicine, and art, but fall short in economics, jurisprudence, pedagogy, literature, history, and management.
Researcher Affiliation Collaboration 1Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China 2School of Information Science and Engineering, East China University of Science and Technology 3Xiaohongshu Inc. 4School of Data Science, Fudan University 5Fudan-Aishu Cognitive Intelligence Joint Research Center
Pseudocode No The paper describes methods like 'Auto Updating' and 'Auto Annotation' but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes All the evaluation code and data are open sourced in https://github.com/Mike Gu721/Xiezhi Benchmark
Open Datasets Yes Xiezhi consists of 249,587 questions from mainly two different sources. The first category includes nearly 170k multiple-choice questions collected from six different examinations in China: elementary school exams, middle school entrance exams, college entrance exams, undergraduate exams, graduate entrance exams, and adult education exams. These questions are all open sourced and many Chinese knowledge evaluation dataset have employed these questions (Huang et al. 2023; Liu et al. 2023).
Dataset Splits No Although previous researches use a 5-shot setting, our experiments have much bigger options number for each question, taking the maximum input length of each LLM into consideration, we only use at most 3 examples in our few-shot learning experiments. The examples used for demonstration were obtained from Xiezhi-Train, a dataset containing 2,555 questions absent from Xiezhi-Speciality and Xiezhi-Interdiscipline, with a minimum of two labels matching the test questions, an illustration is depicted in Fig. 4.
Hardware Specification Yes Our experiment was carried out on a DGX Station with 8 80G memory Tesla A100.
Software Dependencies No To reduce the effect of randomness on our experiment, we set the random seed of some python libraries used in our experiment, which are Numpy, Random, and Torch, to 42.
Experiment Setup Yes To give more precise evaluation results, we propose a new evaluation setting in this paper. We set 50 options for each multiple-choice question, as previous researchers use only 4 options, resulting in significantly reducing the accuracy of random guessing and thus better revealing the model s real capabilities. We rank all options of each model in generation probability, as previous researchers use instructions to query the choice made by each model, to avoid inaccurate evaluations due to model s inability in answering multiple-choice questions or errors in the generated content extraction. ... The experiments are conducted under in 0-shot, 1-shot, 3-shot demonstration setting...