SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research
Authors: Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, Kai Yu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments on most advanced LLMs show that, although GPT-4 achieves SOTA performance compared to other LLMs, there is still substantial room for improvement, especially for dynamic questions. |
| Researcher Affiliation | Academia | X-LANCE Lab, Department of Computer Science and Engineering Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Shanghai Jiao Tong University, Shanghai, China {slt19990817, csyanghan, zhao mengxin, mada123}@sjtu.edu.cn {ieee-szn, 15368493547, chenlusz, kai.yu}@sjtu.edu.cn |
| Pseudocode | No | The paper describes methods and data collection processes but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | The codes and data are publicly available on https://github.com/Open DFM/Sci Eval. |
| Open Datasets | Yes | The primary source of Static Data is Socratic Q&A2, a community-driven website that covers a wide range of subjects such as science and literature. Specifically, we collect data from the fields of biology, chemistry, and physics. ... To make the dataset more diverse and comprehensive, we further integrate data from some publicly available datasets: Med QA (Jin et al. 2021) is a free-form multiple-choice Open QA dataset... Pub Med QA (Jin et al. 2019) is a biomedical question-answering dataset... Reagent Selection (Guo et al. 2023)... For chemistry data, we use the basic information and properties of molecules crawled from Pub Chem3 to create data. |
| Dataset Splits | Yes | For Static Data, we further split the data into dev, valid, and test set. For each data source, each knowledge domain, and each discipline, we randomly select 5 data to form the dev set, which can be used for few-shot learning, and we split the remaining data with a ratio of 1:9 to construct the valid set and test set respectively. |
| Hardware Specification | No | The paper lists the models evaluated but does not provide specific hardware details (GPU/CPU models, memory, etc.) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers needed to replicate the experimental environment. |
| Experiment Setup | No | The paper describes the prompt settings and the types of evaluation settings (Answer-Only, Chain-Of-Thought, 3-Shot) but does not provide specific experimental setup details like hyperparameters (e.g., learning rates, batch sizes, number of epochs) for the evaluations or for the data generation process. |