reproducibilityindex.ai

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Authors: Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, Wei Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	we introduce an expansive benchmark suite SCIBENCH for LLMs. SCIBENCH contains a carefully curated dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
Researcher Affiliation	Academia	1University of California, Los Angeles, Los Angeles, CA, USA 2California Institute of Technology, Pasadena, CA, USA 3University of Washington, Seattle, WA, USA.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Finally, we make our dataset and code publicly available at this repository.
Open Datasets	Yes	SCIBENCH contains a carefully curated dataset of college-level scientific problems, including 869 problems collected from widely-used textbooks in college-level Chemistry, Physics, and Mathematics courses. ... Finally, we make our dataset and code publicly available at this repository.
Dataset Splits	Yes	In the few-shot setting, a few examples are given to the models before the test example. This aims to assess their capability to learn new information from the demonstrations and incorporate it into their problem-solving processes. ... Few-shot examples, including solutions, are randomly selected from problems within each textbook.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions using "Python code" and "Wolfram language" and refers to "Easy OCR (Jaided AI, 2022)", but it does not provide specific version numbers for these software components or libraries.
Experiment Setup	Yes	We set temperature to zero for all models to reduce the randomness of the predictions. Few-shot examples, including solutions, are randomly selected from problems within each textbook. When external tools are used, we add a code snippet that translates the solution into specific programming languages in all few-shot examples. The code snippets are verified by human annotators that will produce the correct output. In terms of evaluation metrics, we compare the model outputs with the correct answers, allowing a relative tolerance of 5%.