reproducibilityindex.ai

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Authors: Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, Kai Yu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments on most advanced LLMs show that, although GPT-4 achieves SOTA performance compared to other LLMs, there is still substantial room for improvement, especially for dynamic questions.
Researcher Affiliation	Academia	X-LANCE Lab, Department of Computer Science and Engineering Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Shanghai Jiao Tong University, Shanghai, China {slt19990817, csyanghan, zhao mengxin, mada123}@sjtu.edu.cn {ieee-szn, 15368493547, chenlusz, kai.yu}@sjtu.edu.cn
Pseudocode	No	The paper describes methods and data collection processes but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	The codes and data are publicly available on https://github.com/Open DFM/Sci Eval.
Open Datasets	Yes	The primary source of Static Data is Socratic Q&A2, a community-driven website that covers a wide range of subjects such as science and literature. Specifically, we collect data from the fields of biology, chemistry, and physics. ... To make the dataset more diverse and comprehensive, we further integrate data from some publicly available datasets: Med QA (Jin et al. 2021) is a free-form multiple-choice Open QA dataset... Pub Med QA (Jin et al. 2019) is a biomedical question-answering dataset... Reagent Selection (Guo et al. 2023)... For chemistry data, we use the basic information and properties of molecules crawled from Pub Chem3 to create data.
Dataset Splits	Yes	For Static Data, we further split the data into dev, valid, and test set. For each data source, each knowledge domain, and each discipline, we randomly select 5 data to form the dev set, which can be used for few-shot learning, and we split the remaining data with a ratio of 1:9 to construct the valid set and test set respectively.
Hardware Specification	No	The paper lists the models evaluated but does not provide specific hardware details (GPU/CPU models, memory, etc.) used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers needed to replicate the experimental environment.
Experiment Setup	No	The paper describes the prompt settings and the types of evaluation settings (Answer-Only, Chain-Of-Thought, 3-Shot) but does not provide specific experimental setup details like hyperparameters (e.g., learning rates, batch sizes, number of epochs) for the evaluations or for the data generation process.