KoLA: Carefully Benchmarking World Knowledge of Large Language Models
Authors: Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Daniel Zhang-Li, Xin Lv, Hao Peng, Zijun Yao, Xiaohan Zhang, Hanming Li, Chunyang Li, Zheyuan Zhang, Yushi Bai, Yantao Liu, Amy Xin, Kaifeng Yun, Linlu GONG, Nianyi Lin, Jianhui Chen, Zhili Wu, Yunjia Qi, Weikai Li, Yong Guan, Kaisheng Zeng, Ji Qi, Hailong Jin, Jinxin Liu, Yu Gu, Yuan Yao, Ning Ding, Lei Hou, Zhiyuan Liu, Xu Bin, Jie Tang, Juanzi Li
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate 28 open-source and commercial LLMs and obtain some intriguing findings. |
| Researcher Affiliation | Academia | Tsinghua University, Beijing, China, 100084 |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The evaluation source codes and data samples for all the tasks are submitted as supplementary material. We release a toolkit to support Ko LA-related functions at Github, including: (1) Easy-to-submit. (2) Result Reproduction. (3) Data Acquisition. |
| Open Datasets | Yes | In Ko LA, we host a new competition season every three months. For each season, we crawl and annotate 500 recently published articles as the evolving data. We chose Wikipedia as our known data source due to its common use. The news data is from The Guardian and we access it strictly following the terms and conditions. The fiction data is from Archive of Our Own (AO3). we select Wikidata5M (Wang et al., 2021a), a high-quality subset of Wikidata, as the basis |
| Dataset Splits | No | The paper specifies the size of its test sets (e.g., 'randomly select 300 examples' for Few NERD, 'preserve 100 triplets' for ETM) and the overall pool of data available, but it does not provide details on training or validation splits for the benchmark datasets themselves, as it focuses on evaluating pre-trained models. |
| Hardware Specification | Yes | The evaluation experiments are conducted on an Ubuntu 20.04.4 server equipped with 112 Intel Xeon(R) Platinum 8336C CPU cores, and graphic cards that contained 8 NVIDIA A100 SXM 80GB GPUs. |
| Software Dependencies | Yes | The CUDA version is 11.4, the Python version is 3.10.0, the Py Torch version is 2.0.0 and the transformers version is 4.28.1. |
| Experiment Setup | Yes | Table 25: The Task Adaption Parameters in Ko LA (Season 1st). Max Train and Max Eval correspond to the maximum number of examples and test cases. Output corresponds to the output format. We both include necessary and additional parameters required to fully specify evaluation (i.e. the number of evaluation instances and runs, which influence the statistical validity and reliability of the results), though they are not strictly part of the adaptation process, as HELM (Liang et al., 2022) presents. |