SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
Authors: Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, Wei Wang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we introduce an expansive benchmark suite SCIBENCH for LLMs. SCIBENCH contains a carefully curated dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%. |
| Researcher Affiliation | Academia | 1University of California, Los Angeles, Los Angeles, CA, USA 2California Institute of Technology, Pasadena, CA, USA 3University of Washington, Seattle, WA, USA. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Finally, we make our dataset and code publicly available at this repository. |
| Open Datasets | Yes | SCIBENCH contains a carefully curated dataset of college-level scientific problems, including 869 problems collected from widely-used textbooks in college-level Chemistry, Physics, and Mathematics courses. ... Finally, we make our dataset and code publicly available at this repository. |
| Dataset Splits | Yes | In the few-shot setting, a few examples are given to the models before the test example. This aims to assess their capability to learn new information from the demonstrations and incorporate it into their problem-solving processes. ... Few-shot examples, including solutions, are randomly selected from problems within each textbook. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using "Python code" and "Wolfram language" and refers to "Easy OCR (Jaided AI, 2022)", but it does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | We set temperature to zero for all models to reduce the randomness of the predictions. Few-shot examples, including solutions, are randomly selected from problems within each textbook. When external tools are used, we add a code snippet that translates the solution into specific programming languages in all few-shot examples. The code snippets are verified by human annotators that will produce the correct output. In terms of evaluation metrics, we compare the model outputs with the correct answers, allowing a relative tolerance of 5%. |