MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

Authors: Yan Cai, Linlin Wang, Ye Wang, Gerard de Melo, Ya Zhang, Yanfeng Wang, Liang He

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings: (1) Chinese medical LLMs underperform on this benchmark, highlighting the need for significant advances in clinical knowledge and diagnostic precision. (2) Several general-domain LLMs surprisingly possess considerable medical knowledge. We perform extensive experiments and offer detailed analyses to provide diverse perspectives for evaluating clinical knowledge recall and reasoning capabilities of LLMs across a range of branches of medicine.
Researcher Affiliation Academia Yan Cai1, Linlin Wang1,2*, Ye Wang1, Gerard de Melo3,4, Ya Zhang2,5, Yanfeng Wang2,5, Liang He1 1East China Normal University 2 Shanghai Artificial Intelligence Laboratory 3 Hasso Plattner Institute 4 University of Potsdam 5 Shanghai Jiao Tong University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper provides links to third-party models and resources (e.g., 'https://github.com/baichuan-inc/Baichuan-13B.', 'https://github.com/michael-wzhu/Chat Med.') but does not state that the code for their own methodology or benchmark is open-source or provide a link to it.
Open Datasets No The paper introduces a new benchmark, Med Bench, stating it comprises '40,041 questions sourced from authentic examination exercises and medical reports' and 'exclusively sourced from the latest validated exams and expert-annotated EHRs'. It also mentions collecting 'representative exercises from the Chinese Medical Licensing Exam (CNMLE), Resident Standardization Training Exam, and Doctor in-charge Qualification Exam'. However, it does not provide a specific link, DOI, or repository for public access to the Med Bench dataset itself.
Dataset Splits No The paper states, 'We conduct extensive experiments to evaluate the five-shot performance of LLMs, ensuring their capability to respond in a multiple-choice format.' and 'Furthermore, we partition the Med Bench dataset based on exams, medical subdiscipline, and question types, and perform independent testing on each subset to enable a comprehensive analysis.' While it describes partitioning for analysis, it does not specify explicit training/validation/test dataset splits for reproducing its own experiments.
Hardware Specification No The paper states, 'The computations were performed in the ECNU Multifunctional Platform for Innovation (001).', but does not provide specific hardware details such as GPU or CPU models.
Software Dependencies No The paper mentions evaluating various LLMs (e.g., 'Chat GPT', 'Chat GLM', 'Baichuan-13B', 'Hua Tuo', and 'Chat Med') and using their APIs or local deployments. However, it does not provide specific version numbers for underlying software dependencies or libraries.
Experiment Setup Yes We conduct extensive experiments to evaluate the five-shot performance of LLMs, ensuring their capability to respond in a multiple-choice format. We leverage the API for Chat GPT5 and opt for local deployment to facilitate evaluations for other LLMs.